Abstract
Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups-point-wise Multidimensional Quality Metrics (MQM), side-by-side (S×S) MQM, and its simplified version S×S relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. S×S MQM extends MQM to pairwise error annotation for two translations of the same input, while S×S RR focuses on selecting the better output without labeling errors. Key findings are: (1) the S×S settings achieve higher inter-annotator agreement than MQM; (2) S×S MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with S×S RR offering a more efficient alternative to (S×S) MQM; (4) the S×S settings highlight subtle errors overlooked in MQM without altering absolute system evaluations. To spur further research, we release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples, each covering 10 systems.
| Original language | English |
|---|---|
| Title of host publication | Long Papers |
| Editors | Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 20536-20551 |
| Number of pages | 16 |
| ISBN (Electronic) | 9798891762510 |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Vienna, Austria Duration: 27 Jul 2025 → 1 Aug 2025 |
Publication series
| Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
|---|---|
| Volume | 1 |
| ISSN (Print) | 0736-587X |
Conference
| Conference | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 |
|---|---|
| Country/Territory | Austria |
| City | Vienna |
| Period | 27/07/25 → 1/08/25 |
Bibliographical note
Publisher Copyright:© 2025 Association for Computational Linguistics.
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language
- Computer Science Applications
Fingerprint
Dive into the research topics of 'Enhancing Human Evaluation in Machine Translation with Comparative Judgment'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver