Abstract
A desirable property of a reference-based evaluation metric that measures the content quality of a summary is that it should estimate how much information that summary has in common with a reference. Traditional text overlap based metrics such as ROUGE fail to achieve this because they are limited to matching tokens, either lexically or via embeddings. In this work, we propose a metric to evaluate the content quality of a summary using question-answering (QA). QA-based methods directly measure a summary’s information overlap with a reference, making them fundamentally different than text overlap metrics. We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval. QAEval outperforms current state-of-the-art metrics on most evaluations using benchmark datasets, while being competitive on others due to limitations of state-of-the-art models. Through a careful analysis of each component of QAEval, we identify its performance bottlenecks and estimate that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.1
| Original language | English |
|---|---|
| Pages (from-to) | 774-789 |
| Number of pages | 16 |
| Journal | Transactions of the Association for Computational Linguistics |
| Volume | 9 |
| DOIs | |
| State | Published - 2 Aug 2021 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2021, MIT Press Journals. All rights reserved.
ASJC Scopus subject areas
- Communication
- Human-Computer Interaction
- Linguistics and Language
- Computer Science Applications
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'Towards question-answering as an automatic metric for evaluating the content quality of a summary'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver