Enhancing information retrieval through statistical natural language processing: A study of collocation indexing

Ofer Arazy, Carson Woo

Research output: Contribution to journalArticlepeer-review


Although the management of information assets-specifically, of text documents that make up 80 percent of these assets-an provide organizations with a competitive advantage, the ability of information retrieval (IR) systems to deliver relevant information to users is severely hampered by the difficulty of disambiguating natural language. The word ambiguity problem is addressed with moderate success in restricted settings, but continues to be the main challenge for general settings, characterized by large, heterogeneous document collections. In this paper, we provide preliminary evidence for the usefulness of statistical natural language processing (NLP) techniques, and specifically of collocation indexing, for IR in general settings. We investigate the effect of three key parameters on collocation indexing performance: directionality, distance, and weighting. We build on previous work in IR to (1) advance our knowledge of key design elements for collocation indexing, (2) demonstrate gains in retrieval precision from the use of statistical NLP for general-settings IR, and, finally, (3) provide practitioners with a useful costbenefit analysis of the methods under investigation.

Original languageEnglish
Pages (from-to)525-546
Number of pages22
JournalMIS Quarterly: Management Information Systems
Issue number3
StatePublished - Sep 2007
Externally publishedYes


  • Collocations
  • Directionality
  • Distance
  • Document management
  • General settings
  • Information retrieval (IR)
  • Natural language processing (NLP)
  • Weighting
  • Word ambiguity

ASJC Scopus subject areas

  • Management Information Systems
  • Information Systems
  • Computer Science Applications
  • Information Systems and Management


Dive into the research topics of 'Enhancing information retrieval through statistical natural language processing: A study of collocation indexing'. Together they form a unique fingerprint.

Cite this