Abstract
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not, a task known as answer verification. In this work, we benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods, BERTScore and LERC. We find that LERC out-performs the other methods in some settings while remaining statistically indistinguishable from lexical overlap in others. However, our experiments reveal that improved verification performance does not necessarily translate to overall QA-based metric quality: In some scenarios, using a worse verification method - or using none at all - has comparable performance to using the best verification method, a result that we attribute to properties of the datasets.
Original language | English |
---|---|
Title of host publication | ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Findings of ACL 2022 |
Editors | Smaranda Muresan, Preslav Nakov, Aline Villavicencio |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 3759-3765 |
Number of pages | 7 |
ISBN (Electronic) | 9781955917254 |
DOIs | |
State | Published - 2022 |
Externally published | Yes |
Event | 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 - Dublin, Ireland Duration: 22 May 2022 → 27 May 2022 |
Publication series
Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
---|---|
ISSN (Print) | 0736-587X |
Conference
Conference | 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022 |
---|---|
Country/Territory | Ireland |
City | Dublin |
Period | 22/05/22 → 27/05/22 |
Bibliographical note
Publisher Copyright:© 2022 Association for Computational Linguistics.
ASJC Scopus subject areas
- Computer Science Applications
- Linguistics and Language
- Language and Linguistics