Abstract
Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.
Original language | English |
---|---|
Title of host publication | Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 |
Editors | Daniel Tapias, Irene Russo, Olivier Hamon, Stelios Piperidis, Nicoletta Calzolari, Khalid Choukri, Joseph Mariani, Helene Mazo, Bente Maegaard, Jan Odijk, Mike Rosner |
Publisher | European Language Resources Association (ELRA) |
Pages | 3389-3392 |
Number of pages | 4 |
ISBN (Electronic) | 2951740867, 9782951740860 |
State | Published - 2010 |
Event | 7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta Duration: 17 May 2010 → 23 May 2010 |
Publication series
Name | Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 |
---|
Conference
Conference | 7th International Conference on Language Resources and Evaluation, LREC 2010 |
---|---|
Country/Territory | Malta |
City | Valletta |
Period | 17/05/10 → 23/05/10 |
Bibliographical note
Funding Information:We wish to thank Gennadi Lembersky for his help. This research was supported by THE ISRAEL SCIENCE FOUNDATION (grants No. 137/06, 1269/07).
ASJC Scopus subject areas
- Education
- Library and Information Sciences
- Linguistics and Language
- Language and Linguistics