The DCU-Huawei Chinese-English Dialogue Corpus is designed to be a movie-subtile-domain and parallel data with dialogue information for research and development purpose. This work is supported by the Science Foundation of Ireland (SFI) ADAPT project (Grant No.:13/RC/2106), and partly supported by the DCU-Huawei Joint Project (Grant No.:201504032-A (DCU), YB2015090061 (Huawei)).
In this version, a 100 thousand (100K) English-Chinese aligned corpus is provided, and it is extracted from a classic American TV series Friends (1-10 seasons). Besides, it contains speaker tags and scene boundary which are all manually anotated according to their corresponding screenplay scripts.
In order to generate a larger corpus, we also provide an automatic method to label speaker tags and scene boundary via projecting information from monolingual script to bilingual subtitle.
All the detailed description are described in this paper:
Longyue Wang, Xiaojun Zhang, Zhaopeng Tu, Andy Way, Qun Liu. (2016). “The Automatic Construction of Discourse Corpus for Dialogue Translation”. To appear in Proceedings of the 10th Language Resources and Evaluation Conference (LREC2016). [pdf] [slides] [bitex]
This corpus can be used for dialogue machine translation as described in following papers:
Longyue Wang, Zhaopeng Tu, Xiaojun Zhang, Hang Li, Andy Way and Qun Liu. (2016). "A Novel Approach for Dropped Pronoun Translation". To appear in Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT2016). [pdf] [bitex]
Longyue Wang, Xiaojun Zhang, Zhaopeng Tu, Hang Li, Qun Liu. (2016). Dropped Pronoun Generation for Dialogue Machine Translation". To appear in Proceedings of the IEEE International Conference of Acoustics, Speech and Signal Processing (ICASSP2016). [pdf] [poster] [bitex]
You should acknowledge with appropriate citation in any publication or presentation containing research results obtained in whole or in part through the use of the DCU-Huawei Chinese-English Dialogue Corpus.
Click here to read the License Agreement.