Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Brian Thompson, Nitika Mathur, Daniel Deutsch, Huda Khayrallah

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Selecting an automatic metric that best emulates human annotators is often non-trivial, because there is no clear definition of “best emulates.” A meta-metric is required to compare the human judgments to the automatic metric scores, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric scores. We show that SPA is more stable than PA with respect to changes in the number of systems/segments used for evaluation. We also show that PA can only assign a small set of distinct output values to metrics, and this results in many metrics being artificially assigned the exact same PA score. We demonstrate that SPA fixes this issue. Finally, we show that SPA is more discriminative than PA, producing more statistically significant comparisons between metrics. SPA was selected as the official system-level metric for the 2024 WMT Metrics Shared Task.

Original languageEnglish
Title of host publicationWMT 2024 - 9th Conference on Machine Translation, Proceedings of the Conference
EditorsBarry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
PublisherAssociation for Computational Linguistics
Pages1222-1234
Number of pages13
ISBN (Electronic)9798891761797
StatePublished - 2024
Externally publishedYes
Event9th Conference on Machine Translation, WMT 2024 - Miami, United States
Duration: 15 Nov 202416 Nov 2024

Publication series

NameConference on Machine Translation - Proceedings
Volume2024-November
ISSN (Electronic)2768-0983

Conference

Conference9th Conference on Machine Translation, WMT 2024
Country/TerritoryUnited States
CityMiami
Period15/11/2416/11/24

Bibliographical note

Publisher Copyright:
©2024 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy'. Together they form a unique fingerprint.

Cite this