| 

Chinese Romanization and its application in information processing


1. Challenge of Computer to Chinese Characters

We are in the information epoch. In this epoch, computer and network play more and more important rule in human life. The language is an effective carrier of information. In information epoch, the computer with only more than 60 years challenged to the Chinese characters with more than 6000 years.

The Chinese character is a kind of ideophonographic character. The ideophonographic character is a graphic character that represents an object or a concept and associated sound element.

The Chinese characters are a big character set.

The most of character set in the world only includes a limited number of characters. The character number included in the character set of different languages is as following:

Languagenumber of character
Latin26
Slavic33
Armenian38
Tamil36
Birma52
Thai44
Lao 27
Tibet35
Korean24
Japanese48

The number of Chinese characters is much more than above languages. Following is the Chinese character number in different Chinese dictionaries from ancient China to Modern China:

EditorDictionary/yearnumber of Chinese characters
Xu Shen说文解字(SHUOWENJIEZI) / 100 A.C.9,353
Gu Yewang玉篇(YUPIAN) / 54316,917
Chen Pengnian广韵(GUANGYUN) / 100826,194
Ding Du集韵(JIYUN) / 106753,525
Mei Yingzuo字汇(ZIHUI) / 161533,179
Chen Tingjing康熙字典(KANGXIZIDIAN) / 171647,043
Zhang Qiyun中文大字典(ZHONGWEN DAZIDIAN) / 197149,888
Xu Zhongshu汉语大字典(HANYU DAZIDIAN) / 199054,678
Leng Yulong中华字海(ZHONGHUA ZIHAI) /199485,000

The Chinese character number in ZHONGHUA ZIHAI arrives to 85,000, but some Chinese characters in this dictionary only are meaningless or soundless signs, they can’t be considered as the authentic Chinese characters. Generally the number of Chinese characters is more than 60,000. It is the biggest character set in the world.

In 20th century, some experts try to invent the Chinese typewriter to type Chinese characters. The Chinese character typewriter is different from the Remington typewriter which based on Latin alphabet. It is extremely complicated and cumbersome.

For example, the Chinese typewriter invented by Wally Johnson, which now is kept in the office of Vickie Fu Doll, Chinese and Korean Studies Librarian in the East Asian Library of the University of Kansas, USA1.


Fig-1 Wally Johnson’s Chinese typewriter

The main tray — which is like a typesetter's font of lead type — has about two thousand of the most frequent Chinese characters. Two thousand Chinese characters are not nearly enough for literary and scholarly purposes, so there are also a number of supplementary trays from which less frequent Chinese characters may be retrieved when necessary.


Fig-2 The tray of typesetter's font of lead type

The pieces of character type are tiny and all of a single metallic shade in the tray, it becomes a maddening task for typist to find the right character.

Another problem is the principle upon which the characters are ordered in the tray.

By radical of Chinese character? By total stroke count of Chinese character? Both of these methods would result in numerous Chinese characters under the same heading.

By rough frequency of Chinese character? By telegraph code of Chinese character? Both of these methods need the good memory of typist.

Unfortunately, nobody seems to have thought to use the easiest and most user-friendly method of arranging the Chinese characters according to their pronunciation.

For all of the above reasons, using a Chinese typewriter was an excruciating experience. Following is a precious photograph of Wally Johnson working at his typewriter:


Fig-3 Wally Johnson working at his typewriter

And another is the photograph of Wally Johnson taking a short break in place:


Fig-4 Wally Johnson taking a short rest

These photos vividly convey the suffering that is associated with using a Chinese typewriter.

The computer also uses the Remington typewriter as the keyboard for human and computer interaction. Obviously, above Chinese typewriter can not be used as the keyboard of computer for human and computer interaction.

The design of computer keyboard is based on Latin alphabet system. If we use Latin alphabet to represent the pronunciation of Chinese characters, then we can get the easiest and most user-friendly method to input or output the Chinese characters according to their pronunciation. Therefore the Romanization of Chinese is very helpful for Chinese Information Processing (CIP)..

2. Romanization of Chinese

The words in a language, which are written according to a given script (the converted system), sometimes have to be rendered according to a different system (the conversion system). The conversion is indispensable in that it permits the univocal transmission of a written message between two countries using different writing systems or exchanging a message, the writing of which is different from their own.

There are two basic methods of conversion of a system of writing: transliteration and transcription.

Transliteration is the operation which consists of representing the characters of an entirely alphabetical character or alphanumeric character system of writing by the characters of the conversion alphabet. In principle, this conversion should be made character by character: each character of the converted alphabet is rendered by one character, and only one character of the conversion alphabet, to ensure the complete and unambiguous reversibility of the conversion alphabet into the converted alphabet (re-transliteration).

Transcription is the operation which consists of representing the characters of a language, whatever the original system of writing, by the phonetic system of letters or signs of the conversion language. A transcription system is of necessity based on the orthographical conventions of a conversion language and its alphabet. The users of a transcription system must therefore have the knowledge of the conversion language to be able to pronounce the characters correctly. Transcription is not strictly reversible. The transcription may be used for the conversion of all writing systems. It is the only method that can be used for systems that are not entirely alphabetical and for all ideophonographic writing systems as Chinese.

Romanization is the conversion of non-Latin writing systems to the Latin alphabet by means of transliteration or transcription. To carry out Romanization it is possible to use either transliteration or transcription or a combination of these two methods, according to the nature of the converted system.

Many years ago, in 1958-02-11, the National People’s Congress of China approved The Scheme for the Chinese Phonetic Alphabet (Hanyu Pinyin, or Pinyin). This scheme is based on the principle of the transcription in Romanization. So we call this scheme as Chinese Romanization.


Fig-5 The Scheme for the Chinese Phonetic Alphabet was approved

3. Pinyin Scheme of Chinese

This scheme provides rules for alphabetic spelling of syllables in Standard Chinese Language of China (Putonghua).

In the Pinyin scheme, each Chinese character generally represents one syllable. One word may consist of one or more syllables. A Chinese syllable can be divided into two parts: initial part and final part.

Initial part of Chinese syllable:

Final part of Chinese syllable:

The table of Chinese syllabic forms is as following.


Fig-6 The table of Chinese syllable forms

Notes:

This table covers all syllables of Chinese Putonghua except syllable ê, syllable er and retroflexion syllables.

This table includes 392 syllables, plus syllable ê, syllable er and retroflexion syllables, the basic syllables of Chinese Putonghua are 405.

The structure of Chinese syllable is simple. It is easy to learn and to remember.

Generally speaking, a Chinese character can be represented by a syllable. Therefore we can use the syllables in Pinyin form to represent all Chinese characters in order to realize Chinese Romanization.

Because the keyboard of computer is designed on the basis of Latin-alphabet, so we can use Pinyin to represent Chinese character in the human-computer interaction.

4. ISO 7098 Information and Documentation: Romanization of Chinese

In 1979, Chinese delegate proposed to take the scheme of Chinese phonetic alphabet as the international standard in ISO TC46 meeting (Paris, Warsaw). In 1982, ISO 7098 Documentation and Information – Chinese Romanization was approved at ISO TC46 meeting (Nanjing) as the first edition. In 1991, ISO 7098 was technically revised. It became the second edition (ISO 7098:1991).

In China, Pinyin, the international standard for Romanization of Chinese, gives impetus to new information technique in the information epoch. In computer application and mobile communication, it is used to input and output Chinese characters in computer, web and mobile phone. Now more than 80% Chinese used Pinyin to deal with Chinese information processing. Pinyin became an important tool for human-computer interaction.


Fig-7 Input and output of Chinese characters by Pinyin


Fig-8 Transcription from Pinyin to Chinese characters

In China, Pinyin also is used in natural language processing and language technique (machine translation, information extraction, information retrieval, text data mining, etc.).

In the international level, Pinyin has been adapted by most libraries around the world. It provides access to bibliographic material of the Chinese language in documentation (including traditional documentation and computerized documentation). In the computerized documentation field, Pinyin plays active role in human-computer interaction. In the end of 20 century, Library of Congress (USA) used Pinyin to catalogue Chinese books (700,000 books) in the library. In the same tine, the Bibliothèque universitaire des langues et civilisations in Paris asked a team of sinological librarians from all over the country, including the Bibliothèque Nationale de France, to ask their opinion on Chinese word segmentation of ISO 7098, in order to establish a common guideline on Chinese word segmentation in Pinyin. The National Library of Australia also adapted Pinyin for Chinese Romanization in documentation.


Fig-9 Discussion of Chinese Romanization in the Bibliothèque universitaire des langues et civilisations, Paris

Now more and more people in the world learn Chinese as a foreign language by the means of Pinyin. Pinyin became an important tool for teaching and learning Chinese. In Computer-Assisted Chinese Language Learning, Pinyin is used for input and output of Chinese characters in the human-computer interaction.

These facts show, Pinyin is a useful tool in human-computer interaction not only in China, but also in the world.

5. Index of ambiguity for Chinese syllables

However,the number of basic Chinese syllables is only 405. These 405 Chinese syllables can represent the pronunciation of all Chinese characters (more than 8,000 characters)2. In this case, one Chinese syllable has to represent averagely more than 19 Chinese characters (8,000/405 = 19.75).

For example, The Pinyin syllable /bei/ can represent following 66 Chinese characters:

北 邶 苝 輩 鉳 貝 狈 呗 珼 钡 垻 唄 鋇 梖 蛽 郥 备 惫 鞴 俻 偹 備 僃 憊 犕 糒 卑 碑 椑 諀 庳 箄 鞞 鹎 背 褙 偝 揹 鄁 禙 倍 蓓 碚 棓 焙 輩 悲 棑 琲 被 陂 骳 杯 盃 孛 臂 鐴 牬 桮 誖 韝 愂 鐾  藣 昁

The Pinyin syllable /jing/ can represent following 88 Chinese characters:

京 惊 猄 澋 燝 綡 鶁 景 鲸 璟 憬 婛 暻 倞 幜 经 径 劲 茎 泾 胫 迳 痉 俓 桱 殌 涇 痙 鵛 颈 弪 坙 刭 静 精 婧 菁 睛 靓 儬 腈 箐 聙 䴖 敬 儆 擏 曔 璥 驚 警 憼 井 荆 汫 妌 宑 丼 肼 汬 镜 竟 境 璄 獍 璄 傹 净 瀞 婙 竫 婙 晶 橸 粳 仱 旌 劤 坓 坕 旍 梷 竧 殑 旌 兢 麠

This means that Pinyin syllable has ambiguity in representation of Chinese characters.

We can use the ambiguity index to describe the degree of ambiguity of Pinyin syllable. The ambiguity index of a Pinyin syllable (I) equals the number of Chinese characters represented with this Pinyin syllable (N) minus 1. The formula is as following:

I = N – 1

This formula means that if one Pinyin syllable can represent N Chinese characters, its ambiguity index (I) equals N – 1.

Therefore we may use the ambiguity index of Pinyin to describe the ambiguity degree of Pinyin syllable in representation of Chinese characters.

If one Pinyin syllable can represent one Chinese character, its ambiguity index is zero. If one Pinyin syllable can represent two Chinese characters, its ambiguity index is 2 – 1 = 1. If one Pinyin syllable can represent three Chinese characters, its ambiguity index is 3 – 1 = 2. ...etc.

In our example, the Pinyin syllable /bei/ can represent 66 Chinese characters, its ambiguity index is 66 – 1 = 65; the Pinyin syllable /jing/ can represent 88 Chinese characters, its ambiguity index is 88 – 1 = 87.

However if we combine these two monosyllables /bei/ and /jing/ to form a bi-syllabic word /beijing/, the ambiguity index will reduce, because /beijing/ can only represent three Chinese bi-syllabic words:

北京, 背景, 背静

The ambiguity index of /beijing/ reduced to 3 –1 = 2.

And if we capitalize the first letter of /beijing/ as /Beijing/, the ambiguity index will be reduced to 1 – 1 = 0. It means that /Beijing/ is a Pinyin word without ambiguity, Its sense number is only 1, it is the capital of China:

北京

Therefore if we link different Pinyin monosyllables to form a polysyllabic Chinese word, the ambiguity index of Pinyin syllable will be reduced. It is the advantages of linking different monosyllables to form one polysyllabic Chinese word.

However, at present days, in Chinese linguistics, there is not clear definition of Chinese word, it is difficult to decide the boundary (dividing line) of a Chinese word, and of course it will bring the difficulty to link the monosyllables to form a polysyllabic Chinese word. But the definition of Chinese proper noun is relatively clear. It is not so difficult to link different monosyllables to form a Chinese polysyllabic proper noun (personal names, geographic names, language names, ethnic names, tribe names, religion names, …, etc), because the boundary of Chinese polysyllabic proper noun is easy to decide according to the standards or regulations.

By this reason, at the 38th plenary meeting of ISO/TC 46 (6 May 2011, Sydney), the Chinese delegate proposes to further update ISO 7098:1991 to reflect current Chinese Romanization practice and new development not only in China, but also in the world. At the 39th plenary meeting of ISO/TC 46 (11 May 2012, Berlin), ISO TC 46 resolves to accept the China’s proposal at Working Draft (WD) stage. In 5 November 2013, the CD ballot is approved. At the 41th plenary meeting of ISO/TC 46 (4 May 2014, Washington D. C.), the Chinese delegate shall submit the Draft of International Standard (DIS) revised on the comments at the CD ballot stage..

In ISO 7098 updating version, Chinese delegate proposed and shall propose the following suggestions for the transcription rules of personal names, geographic names, language names, ethnic names, tribe names and religion names in Chinese language. We believe that this kind of transcription will be the first step for Chinese transcription based on the Chinese word (including monosyllabic word and polysyllabic word, etc).

6. Suggestions for Updating ISO 7098

6.1 Chinese personal names are to be written separately with the surname first, followed by the given name written as one word, with the initial letters of both capitalized. The traditional compound surnames are to be written together without a hyphen. The double two-character surnames are to be written together with a hyphen and the initial letters of both capitalized.

6.2 The surname, given name, seniority order after the adjuncts “Xiao”, “Lao”, “Da”, “A” are to be written separately and with the initial letter both capitalized.

For example, Xiao Liu (小刘,younger Liu), Lao Qian (老钱,older Qian).

6.3 Certain proper names and titles have already fused and are written as one word with the initial letter capitalized.

For example, Kongzi (孔子, Master Confucius) ,Xishi (西施,acme of beauty, 5th cent. B.C.).

6.4Chinese place names should separate the geographical proper name from the geographical feature name and capitalize the first letter of both.

For example, Beijing Shi (北京市, Beijing Municipality) , Hebei Sheng (河北省, Hebei Province).

6.5If a geographical proper name or geographical feature name has a monosyllabic adjunct, write them together as one word.

For example, Jingshan Houjie (景山后街, Jingshan Back Street), Chaoyangmennei Nanxiaojie (朝阳门内南小街, South Street inside Chaoyangmen Gate).

6.6 The names of smaller villages and towns and other place names in which it is not necessary to distinguish between the proper place name and the geographical feature name are to be written together as one unit.

For example, Wangcun (王村,Wang Village) , Zhoukoudian (周口店,an historical site)

6.7 In accordance with the principle of adhering to the original, non-Chinese personal names and place names are to be written in their original Roman (Latin) spelling. While personal names and place names from non-romanized scripts are to be spelled according to the rules for Romanization for that language. For reference, Chinese characters or their Hanyu Pinyin equivalent may be noted after the original name. Under certain conditions, the Hanyu Pinyin may precede or replace the original spelling.

For example, Marx (马克思, Makesi), Darwin (达尔文,Daerwen)

6.8Transcribed names which have already become Chinese words are to be spelled according to their Chinese pronunciation.

For example, Feizhou (非洲, Africa) , Nanmei (南美, South America), Deguo (德国, Germany), Dongnanya (东南亚, Southeast Asia).

6.9 In some cases, all the letters in personal name and geographical name may be capitalized.

For example, BEIJING (北京, Beijing), LI HUA (李华,Li Hua).

6.10 In the abbreviation of personal names, the surnames are to be written with initial capitalized letter or with all capitalized letters; the given names are to be written with first capitalized letter in every syllables and are to be added a dot after the capitalized letter.

For example, Li H. or LI H. for Li Hua (李华) , Wang J.G. or WANG J.G. for Wang Jianguo (王建国)

6.11 The abbreviation of geographical names written together as one word, is to be written with first capitalized letter in every syllable; all capitalized letters in the syllable are to be linked together.

For example, BJ for Beijing (北京), HZ for Hangzhou (杭州).

6.12 Language names are written as one word with the initial letter capitalized.

For example, Hanyu 汉语(Chinese), Yinghu 英语(Englissh).

6.13 Ethnic names and tribe names are written as one word with the initial letter capitalized.

For example, Hanzu 汉族(Chinese ethnic group), Maolizu 毛利族(Maori tribe).

6.14 Religion names are written as one word with the initial letter capitalized.

For example, Jidujiao 基督教(Christianity), Tainzhujiao 天主教(Catholicism).

The detailed spelling rules of personal names and geographical names should be alphabetized according to the regulations Spelling Rules for Chinese Personal Names and Spelling Rules for Chinese Geographical Place Names (the part of Chinese Geographical Names).

The detailed spelling rules of common words are more complex than the rules of these proper nouns (naming entities). The rules of pinyin orthography for Chinese common words are included in the National Standard of China Basic Rules for Hanyu Pinyin Orthography (GB/T 16159-2011)..

These facts show, Pinyin is a useful tool in human-computer interaction not only in China, but also the world.

References

[1] Scheme of Chinese phonetic alphabet, Selections of Norms and Standards for Language and Script of China, Beijing: Standards Press of China, 1997, P441.

[2] Directives for the promotion of Putonghua, promulgated by the State Council of China, Selections of Norms and Standards for Language and Script of China, Beijing: Standards Press of China, 1997, P439-440.