Extraction of multi-word expressions from small parallel corpora

Yulia Tsvetkov, Shuly Wintner

Research output: Contribution to journalArticlepeer-review

Abstract

We present a general, novel methodology for extracting multi-word expressions (MWEs) of various types, along with their translations, from small, word-aligned parallel corpora. Unlike existing approaches, we focus on misalignments; these typically indicate expressions in the source language that are translated to the target in a non-compositional way. We introduce a simple algorithm that proposes MWE candidates based on such misalignments, relying on 1:1 alignments as anchors that delimit the search space. We use a large monolingual corpus to rank and filter these candidates. Evaluation of the quality of the extraction algorithm reveals significant improvements over naÃve alignment-based methods. The extracted MWEs, with their translations, are used in the training of a statistical machine translation system, showing a small but significant improvement in its performance.

Original languageEnglish
Pages (from-to)549-573
Number of pages25
JournalNatural Language Engineering
Volume18
Issue number4
DOIs
StatePublished - Oct 2012

ASJC Scopus subject areas

  • Software
  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Extraction of multi-word expressions from small parallel corpora'. Together they form a unique fingerprint.

Cite this