Skip to main navigation Skip to search Skip to main content

Probing Multimodal LLMs as World Models for Driving

  • Shiva Sreeram
  • , Tsun Hsuan Wang
  • , Alaa Maalouf
  • , Guy Rosman
  • , Sertac Karaman
  • , Daniela Rus

Research output: Contribution to journalArticlepeer-review

Abstract

We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.

Original languageEnglish
Pages (from-to)11403-11410
Number of pages8
JournalIEEE Robotics and Automation Letters
Volume10
Issue number11
DOIs
StatePublished - 2025

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

Keywords

  • Performance evaluation and benchmarking
  • autonomous vehicle navigation
  • data sets for robotic vision

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Biomedical Engineering
  • Human-Computer Interaction
  • Mechanical Engineering
  • Computer Vision and Pattern Recognition
  • Computer Science Applications
  • Control and Optimization
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Probing Multimodal LLMs as World Models for Driving'. Together they form a unique fingerprint.

Cite this