The Kenya Language Corpus Initiative (v1.0.0)

Kenya is one of Africa's most linguistically diverse nations. With an estimated 68 or more distinct languages spoken across its territory, the country represents a remarkable concentration of the Nilotic, Bantu, Cushitic, and other language families.

This is a structured repository for the Kenyan Language Corpus Initiative. The repository will serve as a central, digital repository for linguistic materials including audio recordings, grammatical descriptions, lexical databases, transcribed texts, and sociolinguistic surveys documenting Kenya's linguistic heritage for researchers, communities, and future generations.

Communities in the Repository

Select a community to browse its collections.

Now showing 1 - 5 of 6

Arabic Heritage Languages Community
Repository community for Arabic heritage language varieties of Kenya: Modern Standard Arabic, Hadrami Arabic, Omani Arabic (a heritage variety whose historical influence underlies the Arabic-derived stratum of Swahili), Kinubi (Nubi Creole Arabic, spoken by the Kibera Nubi community), and Classical/Quranic Arabic as used in Kenyan Islamic scholarly tradition and liturgical contexts.
Asian Heritage Languages
Repository community for South Asian heritage languages of Kenya: Gujarati, Kutchi, Punjabi, Tamil, Sindhi, and related varieties. Covers community oral histories, devotional texts, folk songs, and linguistic documentation.
Bantu Languages Community
Repository community for Bantu-family languages indigenous to Kenya, including Kiswahili, Gikuyu, Kikamba, Ekegusii, Kimiiru, Kiembu, Dawida, and the nine Mijikenda varieties. Materials cover linguistics, oral literature, and language documentation across dialects and regional varieties.
Cushitic Languages Community
Repository community for Cushitic-family languages of Kenya: Somali, Oromo (Borana/Gabra), Rendille, El Molo, Dahalo, Orma, Burji, and others. Covers approximately 4 million speakers across about 10 languages, several of which are endangered or critically endangered.
Nilotic Languages Community
Repository community for Nilotic-family languages of Kenya: Dholuo, the Kalenjin cluster, Maa (Maasai and Samburu), Turkana, Teso, and the Luhya cluster. Approximately 30 languages and 14 million speakers.

Recent Submissions

Gikuyu
(Deutscher Wortschatz, 2017-06-01) Deutscher Wortschatz
This item record provides access to the Gikuyu (Kikuyu) corpus contributed to the Leipzig Corpora Collection (LCC) via the CURL (Crawling Under-Resourced Languages) initiative at the University of Leipzig. The corpus consists of randomly selected sentences in Gikuyu harvested from web-crawled sources, processed through the standard LCC pipeline: sentence segmentation, removal of non-sentences and foreign-language material, and random reordering to destroy original document structure in compliance with copyright requirements. Word co-occurrence statistics at sentence level have been precomputed and are included with the data. Limitations: As a web-crawled corpus, the resource may contain domain imbalance skewed toward topics with higher digital representation in Gikuyu (religious texts, political commentary, diaspora media). Quality filtering removes obvious non-Gikuyu content but does not guarantee the absence of code-switching with Kiswahili or English. ISO 639-3: kik. Glottolog: kiku1240. Source: web texts submitted via the CURL community URL-contribution platform. Licence: Creative Commons Attribution (CC BY).
Kidaw'ida
(Mozilla Foundation, 2025-01-19) Mbogho, A., Awuor, Q., Kipkebut, A., Wanzare, L., & Oloo, V.
This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Kidaw'ida [dav - dav]. The dataset contains 49630 clips representing 55.95 hours of recorded speech (9.31 hours validated) from 24 speakers, recorded from a text corpus of 31,892 sentences.
Common Voice Scripted Speech 25.0 - Kalenjin
(Mozilla Foundation, 2025-01-19) Mbogho, A., Awuor, Q., Kipkebut, A., Wanzare, L., & Oloo, V.
This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Kalenjin [kln - kln]. The dataset contains 70042 clips representing 88.12 hours of recorded speech (40.65 hours validated) from 41 speakers, recorded from a text corpus of 29,961 sentences.
DhoNam
(Mozilla Foundation, 2025-12-20) Dr. Lilian D.A Wanzare
DhoNam: Dholuo Speech dataset is a speech corpus designed to supercharge Automatic Speech Recognition (ASR) and other speech technologies for Dholuo, one of Kenya’s major indigenous languages. This dataset contains native-speaker audio recordings collected through a platform where users read aloud a displayed sentence. The dataset includes the audio recordings and the corresponding prompt/sentence that was read.
Common Voice Scripted Speech 25.0 - Swahili
(Mozilla Foundation, 2026-03-23) Mozilla Foundation
A collection of read speech recordings in Swahili (Kiswahili). This datasheet is for cv-corpus-25.0-2026-03-09 of the Mozilla Common Voice Scripted Speech dataset for Swahili [Kiswahili - sw]. The dataset contains 730187 clips representing 1064.02 hours of recorded speech (392.11 hours validated) from 1518 speakers, recorded from a text corpus of 140,486 sentences. This dataset is released under the Creative Commons Zero (CC-0) licence. By downloading this data you agree to not determine the identity of speakers in the dataset.