There's no Data Like Better Data: Using QE Metrics for MT Data Filtering

Jan Thorsten Peter, David Vilar, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Markus Freitag

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Quality Estimation (QE), the evaluation of machine translation output without the need of explicit references, has seen big improvements in the last years with the use of neural metrics. In this paper we analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems (NMT). While most corpus filtering methods are focused on detecting noisy examples in collections of texts, usually huge amounts of web crawled data, QE models are trained to discriminate more fine-grained quality differences. We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half. We also provide a detailed analysis of the filtering results, which highlights the differences between both approaches.

Original languageEnglish
Title of host publicationProceedings of the 8th Conference on Machine Translation, WMT 2023
PublisherAssociation for Computational Linguistics
Pages559-575
Number of pages17
ISBN (Electronic)9798891760417
StatePublished - 2023
Externally publishedYes
Event8th Conference on Machine Translation, WMT 2023 - Singapore, Singapore
Duration: 6 Dec 20237 Dec 2023

Publication series

NameConference on Machine Translation - Proceedings
ISSN (Electronic)2768-0983

Conference

Conference8th Conference on Machine Translation, WMT 2023
Country/TerritorySingapore
CitySingapore
Period6/12/237/12/23

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'There's no Data Like Better Data: Using QE Metrics for MT Data Filtering'. Together they form a unique fingerprint.

Cite this