The Hebrew CHILDES corpus: Transcription and morphological analysis

Aviad Albert, Brian MacWhinney, Bracha Nir, Shuly Wintner

Research output: Contribution to journalArticlepeer-review

Abstract

We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora.

Original languageEnglish
Pages (from-to)973-1005
Number of pages33
JournalLanguage Resources and Evaluation
Volume47
Issue number4
DOIs
StatePublished - Dec 2013

Bibliographical note

Funding Information:
Acknowledgments This research was supported by Grant No. 2007241 from the United States-Israel Binational Science Foundation (BSF). We are grateful to Hadass Zaidenberg, Maayan Bloch and Ezer Rasin for their meticulous lexicographic work, to Arnon Lazerson for developing the conversion script, and to Shai Gretz for helping with the manual annotation.

Keywords

  • CHILDES
  • Hebrew
  • Morphological analysis
  • Morphological disambiguation
  • Transcription of spoken language

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Linguistics and Language
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'The Hebrew CHILDES corpus: Transcription and morphological analysis'. Together they form a unique fingerprint.

Cite this