The following are the training and test data used in the First International Chinese Word Segmentation Bakeoff. The data are either available from this site, or else via a link from the content provider's site, depending upon the preferences of the provider.
The organizations named below have made arrangements to share data for the SIGHAN bake-off and ongoing research in segmentation. Users are advised that the data are being provided for research purposes only. Those wishing to use these data for commercial purposes must contact the providers of these corpora directly.
| Corpus | Training | Testing | Other link |
| Academia Sinica (AS) | -- | -- | |
| UPenn Chinese Treebank (CTB) | -- | -- | -- |
| HK CityU (HK) | -- | -- | -- |
| Peking University (PK) | -- | -- | -- |