DARIJA-C: A Crowdsourced Corpus for Moroccan DARIJA Speech-to-Text Translation

DARIJA-C: A Crowdsourced Corpus for Moroccan DARIJA Speech-to-Text Translation

	© 2024 by IJETT Journal
	Volume-72 Issue-10
	Year of Publication : 2024
	Author : Maria Labied, Abdessamad Belangour, Mouad Banane
	DOI : 10.14445/22315381/IJETT-V72I10P125

How to Cite?
Maria Labied, Abdessamad Belangour, Mouad Banane, "DARIJA-C: A Crowdsourced Corpus for Moroccan DARIJA Speech-to-Text Translation," International Journal of Engineering Trends and Technology, vol. 72, no. 10, pp. 257-266, 2024. Crossref, https://doi.org/10.14445/22315381/IJETT-V72I10P125

Abstract
This paper outlines the development of a Moroccan Darija speech corpus named "DARIJA-C". The primary goal of this corpus is to facilitate the automatic translation of spoken Moroccan Darija into Modern Standard Arabic (MSA) text, offering potential applications across various sectors, including communication, education, and technology. To support ongoing and scalable data collection, we established a web platform that allows for the recording of speech and its corresponding text translation into MSA by anonymous contributors. Future iterations of this project aim to include translations into multiple international languages. The overarching aim of this initiative is to compile the largest and most diverse corpus of Moroccan Darija speech paired with textual translations in various languages. This will create a pioneering resource for the translation of Moroccan Darija speech into multiple languages, thus significantly contributing to the field of speech recognition and translation.

Keywords
Moroccan Darija, Speech corpus, Automatic speech recognition, Speech-to-text translation, Crowdsourcing, Modern standard arabic, Multilingual translation, Speech dataset, Language resources, DARIJA-C.

References
[1] Ahmed Ali, Stephan Vogel, and Steve Renals, “Speech Recognition Challenge in the Wild: Arabic MGB-3,” 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, pp. 316-322, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Mohamed G. Elfeky, Pedro Moreno, and Victor Soto, “Multi-Dialectical Languages Effect on Speech Recognition: Too Much Choice Can Hurt,” Procedia Computer Science, vol. 128, pp. 1-8, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Maria Labied, and Abdessamad Belangour, “Moroccan Dialect “Darija” Automatic Speech Recognition: A Survey,” 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, pp. 208-213, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Ali Can Kocabiyikoglu, Laurent Besacier, and Olivier Kraif, “Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation,” Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, pp. 1-5, 2018.
[Google Scholar] [Publisher Link]
[5] Roldano Cattoni et al., “MuST-C: A Multilingual Corpus for End-to-End Speech Translation,” Computer Speech & Language, vol. 66, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Changhan Wang et al., “CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus,” Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, pp. 4197-4203, 2020.
[Google Scholar] [Publisher Link]
[7] Jean Carletta et al., “The AMI Meeting Corpus: A Pre-Announcement,” Second International Workshop: Machine Learning for Multimodal Interaction, Edinburgh, UK, pp. 28-39, 2005.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Alexis Conneau et al., “FLEURS: Few-Shot Learning Evaluation of Universal Representations of Speech,” 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, pp. 798-805, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Nizar Y. Habash, Introduction to Arabic Natural Language Processing, 1st ed., Synthesis Lectures on Human Language Technologies, Springer Cham, pp. 1-187, 2010.
[CrossRef] [Google Scholar] [Publisher Link]
[10] M. Amour, A. Bouhjar, and F. Boukhris, “Introduction to Amazigh Language,” Paris: IRCAM, 2004.
[Google Scholar]
[11] Fatima Sadiqi, Women, Gender, and Language in Morocco, Brill, pp. 1-336, 2003.
[Google Scholar] [Publisher Link]
[12] Rabih Zbib et al., “Machine Translation of Arabic Dialects,” 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada, pp. 49-59, 2012.
[Google Scholar] [Publisher Link]
[13] Bezoui Mouaz, Beni Hssane Abderrahim, and Elmoutaouakkil Abdelmajid, “Speech Recognition of Moroccan Dialect Using Hidden Markov Models,” Procedia Computer Science, vol. 151, pp. 985-991, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Mohamed Hassine, Lotfi Boussaid, and Hassani Messaoud, “Maghrebian Dialect Recognition Based on Support Vector Machines and Neural Network Classifiers,” International Journal of Speech Technology, vol. 19, pp. 687-695, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Mohamed Belgacem, “Construction of a Robust Corpus of Different Arabic Dialects,” Proceedings of the 8th Young Researchers in Speech Meeting, vol. 33, 2009.
[Google Scholar]
[16] Djegdjiga Amazouz, Martine Adda-Decker, and Lori Lamel, “Addressing Code-Switching in French/Algerian Arabic Speech,” Interspeech 2017, pp. 62-66, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Ahmed Ali et al., “The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, pp. 1026-1033, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Anass Allak et al., “Dialectal Voice : An Open-Source Voice Dataset and Automatic Speech Recognition Model for Moroccan Arabic Dialect,” NeurIPS Data-Centric AI Workshop, 2021.
[Google Scholar] [Publisher Link]

IJBTT

DARIJA-C: A Crowdsourced Corpus for Moroccan DARIJA Speech-to-Text Translation