Abstract
We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.
| Original language | English |
|---|---|
| Title of host publication | WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 194-204 |
| Number of pages | 11 |
| ISBN (Electronic) | 9781959429272 |
| DOIs | |
| State | Published - 2022 |
| Event | 7th Arabic Natural Language Processing Workshop, WANLP 2022 held with EMNLP 2022 - Abu Dhabi, United Arab Emirates Duration: 8 Dec 2022 → … |
Publication series
| Name | WANLP 2022 - 7th Arabic Natural Language Processing - Proceedings of the Workshop |
|---|
Conference
| Conference | 7th Arabic Natural Language Processing Workshop, WANLP 2022 held with EMNLP 2022 |
|---|---|
| Country/Territory | United Arab Emirates |
| City | Abu Dhabi |
| Period | 8/12/22 → … |
Bibliographical note
Publisher Copyright:© 2022 Association for Computational Linguistics.
ASJC Scopus subject areas
- Language and Linguistics
- Computational Theory and Mathematics
- Software
- Linguistics and Language