State of What Art? A Call for Multi-Prompt LLM Evaluation

Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, Gabriel Stanovsky

Research output: Contribution to journalArticlepeer-review

Abstract

Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, in-volving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different perfor-mance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. down-stream development), ensuring a more reliable and meaningful assessment of LLM capabil-ities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.

Original languageEnglish
Pages (from-to)933-949
Number of pages17
JournalTransactions of the Association for Computational Linguistics
Volume12
DOIs
StatePublished - 2024

Bibliographical note

Publisher Copyright:
© 2024 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Communication
  • Human-Computer Interaction
  • Linguistics and Language
  • Computer Science Applications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'State of What Art? A Call for Multi-Prompt LLM Evaluation'. Together they form a unique fingerprint.

Cite this