header image

Home >


Building an English-Chinese Domain-Comparable Corpus

skip to content
Date
01 Apr 06 - 31 May 06
Sponsor
British Academy
Project ID
SG-42140
Award
£ 7,485
Keywords
corpus linguistics, translation
Internal Code
CSD7771

Summary

In this project, we will develop a framework consisting of methodology and tools for the compilation of an English/Chinese comparable corpus for a given domain. In practice, we will test the key-words approach and feature vector space model for identifying similar texts and controlling the level of similarity of texts. We will use XML/TEI conformant framework for marking-up the comparable corpus. We will test LDC (Linguistic Data Consortium) English/Chinese bilingual lexicon for cross-lingual matching of the texts. As test data, we will use LDC and Proquest English journalistic corpora available at Lancaster and Chinese journalistic corpora available at CCID. Finally, we will jointly organise a workshop with CCID in Beijing for the exchange of research experiences and ideas.

Principal Investigator - Lancaster
Paul Rayson
Researchers
Scott Songlin Piao
Partner
China Centre for Information Industry Development (CCID)
Research Theme
Cooperative & Interactive Systems