Building an English-Chinese Domain-Comparable Corpus
- Date
- 01 Apr 06 - 31 May 06
- Sponsor
- British Academy
- Project ID
- SG-42140
- Award
- £ 7,485
- Keywords
- corpus linguistics, translation
- Internal Code
- CSD7771
Summary
In this project, we will develop a framework consisting of methodology and tools for the compilation of an English/Chinese comparable corpus for a given domain. In practice, we will test the key-words approach and feature vector space model for identifying similar texts and controlling the level of similarity of texts. We will use XML/TEI conformant framework for marking-up the comparable corpus. We will test LDC (Linguistic Data Consortium) English/Chinese bilingual lexicon for cross-lingual matching of the texts. As test data, we will use LDC and Proquest English journalistic corpora available at Lancaster and Chinese journalistic corpora available at CCID. Finally, we will jointly organise a workshop with CCID in Beijing for the exchange of research experiences and ideas.
- Principal Investigator - Lancaster
- Paul Rayson
- Researchers
- Scott Songlin Piao
- Partner
- China Centre for Information Industry Development (CCID)
- Research Theme
- Cooperative & Interactive Systems
