SIGHAN First International Chinese Word Segmentation Bakeoff


The following comprises the complete description of the training and testing for the First International Chinese Word Segmentation Bakeoff. By participating in this competition, you are declaring that you understand these descriptions, and that you agree to abide by the specific terms as laid out below.


Dimension 1: Corpora

Four corpora are available for this bakeoff:

Training materials from these corpora will be available March 15. Links to descriptions of these corpora and the segmentation standards assumed will also be provided.

You may declare that you will return results on any subset of these corpora. For example, you may decide that you will test on the UPenn Corpus and the Beijing University corpus. The only constraint is that you must not select a corpus where you have knowingly had previous access to the testing portion of the corpus. A corollary of this is that a team may not test on the data from their own institution.

Dimension 2: Open/Closed

You may decide to participate in either an open test or a closed test, or both.

In the open test you will be allowed to train on the training set for a particular corpus, and in addition you may use any other material including material from other training corpora, proprietary dictionaries, material from the WWW and so forth. If you elect the open test, you will be required, in the two-page writeup of your results, to explain what percentage of your correct/incorrect results came from which sources. For example, if you score an F measure of 0.7 on words in the testing corpus that are out-of-vocabulary with respect to the training corpus, you must explain how you got that result: was it just because you have a good coverage dictionary, do you have a good unknown word detection algorithm, etc?

In the closed test you may only use training material from the training data for the particular corpus you are testing on. no other material is allowed.


When you download the training corpora, you will be asked to register and provide various information about your site, including the contact person, and you will be asked to declare which tracks you will participating in. A possible declaration would be:

Open Closed
AS Corpus X X
CityU Corpus X
UPenn Corpus
Beijing Corpus X

In this case, the participant would be declaring that they intend to participate in both the the open and closed test on the Academia Sinica corpus, in the open test only for the CityU corpus, in the closed test only on the Beijing corpus, and that they will not be participating in the tests on the UPenn Corpus.

Once the data are downloaded, you will be assigned a participant ID. This ID should be used for submitting your final results and for uploading your two page description.

Format of the data

Both training and testing data will be published in the original coding schemes used by the data sources. The training data will be formatted as follows.

  1. There will be one sentence per line.
  2. Words and punctuation symbols will be separated by spaces.
  3. There will be no further annotations, such as part-of-speech tags: if the original corpus includes those, those will be removed.


The corpora have been made available by the providers for the purposes of this competition only. By downloading the training and testing corpora, you agree that you will not use these corpora for any other purpose than as material for this competition. Petitions to use the data for any other purpose MUST be directed to the original providers of the data. Neither SIGHAN nor the ACL will assume any liability for a participant's misuse of the data.


The test data will be available for each corpus at the website at 00:30, U.S. Eastern Daylight time, April 22, 2003. The test data will be in the same format as described for the training data, but of course spaces will be removed.

You will have roughly three days to process the data, format the results and return them to the designated FTP site. The final due date/time is:

April 25, 2003, 17:00, U.S. Eastern Daylight Time.

You should upload your results in a single ZIP file called <participant_number>.zip to the following ftp site:

Late submissions will not be scored.

The format of the result must adhere to the format described for the training data. In particular, there must be one line per sentence, and there must be the same number of lines in the returned data as in the data available from the site. Segmented words and punctuation must be separated by spaces, and there should be no further annotations (e.g. part of speech tags) on the segmented words. The data must be returned in the same coding scheme as they were published in. Participants are reminded that ASCII character codes may occur in Chinese text to represent Latin letters, numbers and so forth: such codes should be left in their original coding scheme. Do not convert them to their GB/Big5 equivalents. Similarly GB/Big5 codings of Latin letters or Arabic numerals should be left in their original coding, and not converted to ASCII.

The results will be scored completely automatically. The scripts that were used to score will be made publicly available. The measures that will be reported are precision, recall, and an evenly-weighted F-measure. We will also report scores for in-vocabulary and out-of-vocabulary words.

Note: by downloading the test material and submitting results on this material you are thereby declaring that you have not previously seen the test material for the given corpus.

You are also declaring that your testing will be fully automatic. This means that any kind of manual intervention is disallowed, including, but not limited to:

  1. Manual correction of the output of your segmentation.
  2. Prepopulating the dictionary with words derived by a manual inspection of the test corpus


Results will be provided in two phases. Privately to individual participants by May 10, 2003, then publicly to all participants and to the community at large at the SIGHAN Workshop. By participating in this contest, you are agreeing that the results of the test may be published, including the names of the participants.


By electing to participate in any part of this contest, you are agreeing to provide, by May 25, 2003, a two-page writeup that briefly describes your segmentation system, and a summary of your results. In the closed tests you may describe the technical details of how you came by the particular results. In the open test you must describe the technical details of how you came by the particular results.

The format of the two-page paper must adhere to the style guidelines for ACL 2003, except for the two page limit.

You should upload your two-page description <participant_number>.<doctype>, where "doctype" is one of the allowed document formats documents (e.g. "doc", "pdf") to the following ftp site: