Chinese names are very unique in a number of aspects.

The characters used in representing surnames and in constructing personal names are also used in constructing real words. This makes it difficult to identify names solely by the characters used.

Although common Chinese surnames can be viewed as a finite set of characters, the set of Chinese given names is infinite.

Chinese given names are usually composed of one to three characters, while two is most common. Very often, these two-character names are also real words. But it is also common that there are no semantic or morphological structures among them.

As a result, the identification of Chinese personal names (and other proper names as well) becomes a difficult task in tokenizing Chinese character strings. Many successful word identification systems rely on specific algorithms or name databases to handle the name identification problem. Even with the algorithmic approach, a good database is still needed in discovering potentially unique characteristics that may be used in identifying names.

It is not easy to construct a good database of personal names. The government will always have some huge lists of names. However, for privacy reasons, ordinary people are not likely to have access to them. Research institutes studying natural language processing may have their own, but their databases may not open to the general public.


