The Denglisch Corpus of German-English Code-Switching

Doreen Osmelak, Shuly Wintner

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

When multilingual speakers involve in a conversation they inevitably introduce code-switching (CS), i.e., mixing of more than one language between and within utterances. CS is still an understudied phenomenon, especially in the written medium, and relatively few computational resources for studying it are available. We describe a corpus of German-English codeswitching in social media interactions. We focus on some challenges in annotating CS, especially due to words whose language ID cannot be easily determined. We introduce a novel schema for such word-level annotation, with which we manually annotated a subset of the corpus. We then trained classifiers to predict and identify switches, and applied them to the remainder of the corpus. Thereby, we created a large-scale corpus of German-English mixed utterances with precise indications of CS points.

Original languageEnglish
Title of host publicationSIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop
EditorsLisa Beinborn, Koustava Goswami, Saliha Muradoglu, Alexey Sorokin, Ritesh Kumar, Andreas Shcherbakov, Edoardo M. Ponti, Ryan Cotterell, Ekaterina Vylomova
PublisherAssociation for Computational Linguistics
Pages42-51
Number of pages10
ISBN (Electronic)9781959429562
StatePublished - 2023
Event5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, SIGTYP 2023, co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 - Hybrid, Dubrovnik, Croatia
Duration: 6 May 2023 → …

Publication series

NameSIGTYP 2023 - 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, Proceedings of the Workshop

Conference

Conference5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, SIGTYP 2023, co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023
Country/TerritoryCroatia
CityHybrid, Dubrovnik
Period6/05/23 → …

Bibliographical note

Publisher Copyright:
© 2023 Association for Computational Linguistics.

ASJC Scopus subject areas

  • Language and Linguistics
  • Artificial Intelligence
  • Human-Computer Interaction
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'The Denglisch Corpus of German-English Code-Switching'. Together they form a unique fingerprint.

Cite this