Abstract
Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, in-volving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different perfor-mance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. down-stream development), ensuring a more reliable and meaningful assessment of LLM capabil-ities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.
Original language | English |
---|---|
Pages (from-to) | 933-949 |
Number of pages | 17 |
Journal | Transactions of the Association for Computational Linguistics |
Volume | 12 |
DOIs | |
State | Published - 2024 |
Bibliographical note
Publisher Copyright:© 2024 Association for Computational Linguistics.
ASJC Scopus subject areas
- Communication
- Human-Computer Interaction
- Linguistics and Language
- Computer Science Applications
- Artificial Intelligence