DCU-Huawei Chinese-English Dialogue Corpus 1.0


The DCU-Huawei Chinese-English Dialogue Corpus is designed to be a movie-subtile-domain and parallel data with dialogue information for research and development purpose. This work is supported by the Science Foundation of Ireland (SFI) ADAPT project (Grant No.:13/RC/2106), and partly supported by the DCU-Huawei Joint Project (Grant No.:201504032-A (DCU), YB2015090061 (Huawei)).

In this version, a 100 thousand (100K) English-Chinese aligned corpus is provided, and it is extracted from a classic American TV series Friends (1-10 seasons). Besides, it contains speaker tags and scene boundary which are all manually anotated according to their corresponding screenplay scripts.

In order to generate a larger corpus, we also provide an automatic method to label speaker tags and scene boundary via projecting information from monolingual script to bilingual subtitle.

