Building Text and Speech Datasets for Low Resourced Languages: A Case of Languages in East Africa
Publication Date
2022Author
Claire Babirye, Joyce Nakatumba-Nabende, Andrew Katumba, Ronald Ogwang, Jeremy Tusubira Francis, Jonathan Mukiibi, Medadi Ssentanda, Lilian D Wanzare, Davis David
Metadata
Show full item recordAbstract/ Overview
Africa has over 2000 languages; however, those languages are not well repre sented in the existing Natural Language Processing ecosystem. African languages
lack essential digital resources to be engaged effectively in the advancing lan guage technologies. This growing gap has attracted researchers to empower and
build resources for African languages to transfer the various Natural Language
Processing methods to African languages. This paper discusses the process we
took to create, curate and annotate language text and speech datasets for low resourced languages in East Africa. This paper focuses on five languages. Four
of the languages: Luganda, Runyankore-Rukiga, Acholi, and Lumasaaba, are ma jorly spoken in Uganda, and Kiswahili which is a majorly spoken language across
East Africa. We have run baseline: machine translation models on the English -
Luganda dataset in the parallel text corpora and Automatic Speech Recognition
(ASR) models on the Luganda speech dataset. We recorded a BiLingual Evalua tion Understudy (BLEU) score of 37 for the English-Luganda model and a BLEU
score of 36.8 for the Luganda-English model. For the ASR experiments, we ob tained a Word Error Rate (WER) of 33%.
Speech, Text, Luganda, Common Voice, ASR, Swahili