Chinese Language Interface


This paper briefs Members on the way forward for the development of a Chinese language interface in Hong Kong.


2. When information is stored inside a computer, it must be codified according to a certain pre-defined coding scheme. For information in Chinese, different coding schemes now exist. The vast majority of Chinese applications in Hong Kong use the "BIG-5" coding scheme, which is based on the traditional Chinese character set. The Chinese coding standard currently used in the Mainland of China is called "GB" (guobiao), which is based on the simplified character set and is not widely used in Hong Kong. Both coding schemes have their variants. For example, BIG-5+ was developed to enhance BIG-5, and GB has an expanded variant called GBK.

3. BIG-5 and GB are limited in scope to different degrees. Of the 60,000 known Chinese characters, BIG-5 contains approximately 13,000 characters. The original GB version contained 6,763 characters and the expanded GBK version covers around 20,000 characters. A common shortcoming of the above-mentioned coding schemes is that they do not cover a fair number of characters commonly used in Hong Kong. Many of these omitted characters are uniquely developed in Hong Kong or have a Cantonese lineage. To cover them, users resort to assigning their own internal codes for these characters for use in their computer systems. This works well in stand-alone computers. But when these computers are connected to a wider network, these end-user developed characters give rise to problems in communication and data exchanges.

4. The current critical issue of using the Chinese language in electronic communication thus revolves around the existence of multiple coding standards with each of them covering only a subset of known Chinese characters. If the Chinese communities around the world are to participate fully in the digitally connected world in the Information Age, we must resolve this issue.

Government Initiative

5. An important element in our "Digital 21" Information Technology Strategy is the development of an open and common Chinese language interface for users in the community who prefer to communicate in Chinese in carrying out electronic transactions.

6. To facilitate electronic communication within Government, we have built up a set of Chinese characters, i.e. the Government Common Character Set (GCCS) to encompass common characters which have not been included in the BIG-5 coding scheme. The GCCS now consists of around 3,000 characters which are mostly characters unique to Hong Kong, such as addresses or colloquial Cantonese characters. Since 1995, we have been using the BIG-5, supplemented by the GCCS, in the implementation of office automation within Government. We have also made the GCCS freely available through the Government web site (http://www.info.gov.hk/gccs). As a result, the GCCS is gaining popularity in the local community. Each month there are around 5,000 downloads of GCCS from Government's web site by Internet users. The Information Technology Services Department and the Official Languages Agency work closely together to update the GCCS to take account of new characters developed.

7. Separately, the Electronic News Media and Publishing Consortium (ENMPC), which is a non-profit making body comprising most of the major operators in the local publishing industry, and the Information Networking Laboratories of the Chinese University of Hong Kong have been working together on the "Universal Hong Kong Cantonese Characters Set Font Development Project". The project is financed by the Industrial Support Fund and aims to establish a set of industry standard Cantonese characters for use by local publishers and for facilitating electronic transfer of information in Chinese. Publishers, vendors, Internet service providers and related parties have worked together under this project to develop a standard set of Cantonese characters which will be free for all publishers to use. A set of electronic filters will be developed so that local publishers can use the filters to convert their publications prepared on the basis of their own character sets to the standard character set of the ENMPC. The ENMPC standard character set will also be widely published through the Internet and distributed on CD-ROMS.

8. Notwithstanding all the efforts described above, we should ultimately adopt a coding scheme which can cater for as large a set of Chinese characters as possible. We are now working closely with other governments and institutions under the aegis of the International Standards Organisation (ISO) in the development of the ISO10646 standard. This is an international coding standard which aims at developing one single common character set to encompass the "Han" characters of all Asian languages. The current version of the ISO 10646 standard has defined 20,902 Chinese characters. Its next version will cover around 60,000 Chinese characters. We are actively working within the framework of the ISO to have the commonly used Hong Kong characters included in this international standard. We have also widely consulted local bodies in the process of providing input to the ISO. When the new ISO10646 is published (scheduled for release in phases starting 2000), we expect that, as a superset of all known coding schemes, it will form the basis of data conversion and exchange between the various coding schemes and will ease the existing limitations of Chinese language computing.

Public Access to Electronic Service Delivery Scheme

9. We are now preparing for the implementation of the Electronic Service Delivery (ESD) scheme to provide public services to the community electronically via an information infrastructure with an open and common interface. The first phase of the scheme is scheduled for implementation in the latter half of 2000. Developing a Chinese language interface is crucial to the success of the implementation of the scheme as Chinese is the mother tongue of the majority of Hong Kong people. We need to adopt a common Chinese character set to facilitate data exchange between Government departments and the public under the scheme. Furthermore, we will have to provide easy-to-use tools for the community to transact business with Government under the scheme using Chinese as the medium of communication.

10. Given that we predominantly use traditional Chinese characters in Hong Kong, the ESD system will initially use traditional Chinese characters to interface with the local community. We intend to use the BIG-5 character set, complemented by the updated version of the GCCS, to be the definitive set of Chinese characters for ESD information interchange. For Government departments' systems which are not using the BIG-5 scheme, data interchange between them and the ESD system will be achieved by conversion.

11. We will also require the future ESD operator to be ISO 10646 ready. This will allow flexibility for the ESD infrastructure to support applications in both traditional and simplified Chinese characters as and when there is such a need and minimise major adjustments to the ESD system to cope with the introduction of future versions of the ISO 10646 standards.

Inputting Chinese

12. Chinese input methods can be classified into three types : keyboard for word form, keyboard for phonetics and non-keyboard. The choice of Chinese input method is essentially a matter of personal preference on the basis of efficiency and applicability. ESD service users with their own devices (e.g. PCs) will be free to choose their own preferred input methods. For providing ESD access facilities in public places, we intend to adopt the following input methods -

  1. keyboard for word form : Cangjie and KanYi as they are the most commonly used input methods in this category;

  2. keyboard for phonetics : both Putonghua and Cantonese pin-yin methods; and

  3. non-keyboard : a pen-based method or other pointing devices.
13. The ESD interface will also allow name input by the Chinese Commercial Code (as stated on identity cards) to facilitate the use of ESD services, many of which require the users to input their own names.

Consultation with the Information Infrastructure Advisory Committee

14. We have consulted the Information Infrastructure Advisory Committee which supported our proposed approach as described in this paper.

Chinese Language Interface

15. Through our efforts in developing the GCCS, providing input to the ISO10646 initiative, giving support to initiatives developed by the local industry and the academic institutions, and the implementation of the ESD scheme, we seek to develop a standard Chinese character set and a common Chinese language interface to facilitate communication both within Hong Kong and with other Chinese speaking communities elsewhere. The standards and techniques thus developed for a common Chinese language interface will provide impetus for the private sector to develop tools and systems to support the conduct of electronic commerce in the Chinese language. These developments will in turn help promote Hong Kong as an Internet content hub.

Information Technology and Broadcasting Bureau
January 1999