Abstract
Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 |
| Editors | Daniel Tapias, Irene Russo, Olivier Hamon, Stelios Piperidis, Nicoletta Calzolari, Khalid Choukri, Joseph Mariani, Helene Mazo, Bente Maegaard, Jan Odijk, Mike Rosner |
| Publisher | European Language Resources Association (ELRA) |
| Pages | 3389-3392 |
| Number of pages | 4 |
| ISBN (Electronic) | 2951740867, 9782951740860 |
| State | Published - 2010 |
| Event | 7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta Duration: 17 May 2010 → 23 May 2010 |
Publication series
| Name | Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 |
|---|
Conference
| Conference | 7th International Conference on Language Resources and Evaluation, LREC 2010 |
|---|---|
| Country/Territory | Malta |
| City | Valletta |
| Period | 17/05/10 → 23/05/10 |
Bibliographical note
Funding Information:We wish to thank Gennadi Lembersky for his help. This research was supported by THE ISRAEL SCIENCE FOUNDATION (grants No. 137/06, 1269/07).
ASJC Scopus subject areas
- Education
- Library and Information Sciences
- Linguistics and Language
- Language and Linguistics