SIGHAN First International Chinese Word Segmentation Bakeoff


which was held as part of the Second Meeting of SIGHAN (the ACL Special Interest Group on Chinese Language Processing), July 11-12, 2003 (in conjunction with ACL 2003) in Sapporo, Japan.


Instructions for the bakeoff. (Chinese versions: traditional; simplified.)

See below for important dates.


There has been a large literature on the topic of segmenting Chinese text into words, and many approaches have been proposed. However, one problem has been that it is very difficult to compare the results of different approaches, since researchers have not been testing their systems on common test corpora. While it is recognized that there is no single correct segmentation, and different applications may require different segmentations, it is nonetheless desirable to be able to compare different segmentation algorithms on common datasets so that one can understand which algorithms are most promising, independent of a particular application.

We aim to address this issue by inviting researchers who work on Chinese word segmentation to put their systems to the test on a common set of training and test corpora. The results of this competition will be published and it is hoped that the results will provide fodder for future work in this area.


Training and test corpora will come from four sources:

  1. The Academia Sinica (Taiwan) treebank (Taiwan Big Five encoding).
  2. The Beijing University Institute of Computational Linguistics Corpus (GB encoding).
  3. The Penn Chinese treebank (GB encoding).
  4. Hong Kong City University corpus (HK Big Five encoding).

Each of these corpora has been hand-segmented according to its own standard. Sizes of training and test corpora are to be determined and will depend upon the amounts available from the four sources.

Participants will be able to elect to be tested on any or all of the corpora, except that participants from the institutions providing the corpora will not be allowed to test on their own corpus.

In addition to electing one or more corpora, participants will also be able to participate in either or both of an Open Track or a Closed Track. For the Closed Track, the participant will be allowed to use ONLY the materials from the training corpus corresponding to each elected test corpus. For the Open Track, the participants may use any resources they choose, including proprietary dictionaries; however, participants will be required, in their summaries (see below), to provide documentation on which of their segmentation decisions were based on material other than the training corpus or what their systems inferred algorithmically from the training corpus.

The training and testing materials will be made available according to a strict schedule as outlined below. Specific instructions on the format of the segmented test data will be provided, and these instructions must be followed exactly.

After the results are reported back to the participants, the participants will be asked to provide a two-page summary of their system for inclusion in the SIGHAN Workshop proceedings.

More details of the process will be posted in due course to the web page listed at the end of this message.


March 15, 2003Training materials and complete instructions available at the website (see below), along with information on and references to the various segmentation standards.
April 22, 2003Testing materials available at the website.
April 25, 2003Segmented test materials due back to ftp site by 5:00 PM, U.S. Eastern Daylight Time. The format of the returned segmentations must adhere to the guidelines described in the published instructions.
May 10, 2003Bakeoff results announced privately to participants.
May 25, 2003Four-page system descriptions due.
July 11, 2003Full results published at the SIGHAN Workshop


for developments.

You may also contact Richard Sproat with questions regarding the contest.
