Automatic acquisition of parallel corpora from websites with dynamic content

Yulia Tsvetkov, Shuly Wintner

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.

Original languageEnglish
Title of host publicationProceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
EditorsDaniel Tapias, Irene Russo, Olivier Hamon, Stelios Piperidis, Nicoletta Calzolari, Khalid Choukri, Joseph Mariani, Helene Mazo, Bente Maegaard, Jan Odijk, Mike Rosner
PublisherEuropean Language Resources Association (ELRA)
Pages3389-3392
Number of pages4
ISBN (Electronic)2951740867, 9782951740860
StatePublished - 2010
Event7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta
Duration: 17 May 201023 May 2010

Publication series

NameProceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010

Conference

Conference7th International Conference on Language Resources and Evaluation, LREC 2010
Country/TerritoryMalta
CityValletta
Period17/05/1023/05/10

Bibliographical note

Funding Information:
We wish to thank Gennadi Lembersky for his help. This research was supported by THE ISRAEL SCIENCE FOUNDATION (grants No. 137/06, 1269/07).

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Automatic acquisition of parallel corpora from websites with dynamic content'. Together they form a unique fingerprint.

Cite this