Documents and queries as random variables: History and implications

David Bodoff, Samuel Po Shing Wong

Research output: Contribution to journalReview articlepeer-review


The view of documents and/or queries as random variables is gaining importance in the theory of information retrieval. We argue that traditional probabilistic models consider documents and queries as random variables, but that newer models such as language modeling and our unified model take this one step further. The additional step is called error in predictors. Such models consider that we don't observe the document and query random variables that are modeled to predict relevance probabilistically. Rather, there are additional random variables, which are the observed documents and queries. We discuss some important implications of this idea for parameter estimation, relevance prediction, and even test-collection construction. By clarifying the positions of various probabilistic models on this question, and presenting in one place many of its implications, this article aims to deepen our common understanding of the theories behind traditional probabilistic models, and to strengthen the theoretical basis for further development of more recent approaches such as language modeling.

Original languageEnglish
Pages (from-to)1138-1154
Number of pages17
JournalJournal of the American Society for Information Science and Technology
Issue number9
StatePublished - Jul 2006

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications
  • Artificial Intelligence


Dive into the research topics of 'Documents and queries as random variables: History and implications'. Together they form a unique fingerprint.

Cite this