Artificial Intelligence and the Underrepresentation of African Languages

Much has been written about Artificial Intelligence (AI), mainly in English, including by OER Africa.[1] English is the predominant language on the Internet, in research and publications, and in education.  African languages are vastly underrepresented in the global knowledge pool, even though scholars at Harvard University believe that with between 1,000 and 2,000 languages, Africa is home to about one third of the world’s languages.[2]

Artificial intelligence (AI) can play an important role in mitigating these language challenges. Already, international search engines, such as Google, play a large role in using AI to translate English into African languages and vice-versa. Efforts are constrained, however, by the paucity of documents on the web written in most African languages. Additionally, networks of African researchers have become actively engaged in looking for ways to increase the data on the web in African languages, including documenting scientific terms in the African languages where no such terms currently exist. Such data will then be available for use by AI to improve access to African languages. Importantly, they are trying to grow the field of African AI researchers by building networks and finding AI language technology solutions.  

Many of us think about Google Translate when we want to understand what has been written in a language that we do not understand. Google Translate is now supported in 25 African languages: Afrikaans, Amharic, Arabic, Bambara, ChichewaEwe, Hausa, Igbo, Kinyarwanda, Krio, Lingala, Luganda, Malagasy, Oromo, Sepedi, Swahili, Sesotho, Shona, Somali, Tigrinya, Tsonga, Twi, Xhosa, Yoruba, and Zulu. Several of these languages are spoken across borders. The good news is that the number of them keeps increasing. The bad news is that there does not seem to be any one place to ascertain which African languages are covered; this can only be determined through searches within Google Translate. Furthermore, Google Translate uses machine translation, which is mostly accurate, but not entirely.

The Nigerian linguist, Aremu Adeola, uses an interesting example about why context matters in many languages, including Yoruba:[3]

"Most translations done by machines render some words wrong, especially words that are culturally nuanced. For example, Yorùbá words ayaba and obabìnrin have their meanings situated in a cultural context. Most machines translate both words as ‘queen.’ However, from a traditional-cum-cultural vantage point, it is essential to note that the meanings of ayaba and obabìnrin are different: Ọbabìnrin means ‘queen’ in English while ayaba is ‘wife of the king.’"

Using AI as a translation tool is not straightforward. Most AI tools:[4]

"Rely on a field of AI called natural language processing, a technology that enables computers to understand human languages. Computers can master a language through training, where they pick up on patterns in speech and text data. However, they fail when data in a particular language is scarce, as seen in African languages."

The South African science journalist, Sibusiso Biyela, gives an excellent example of just how difficult it can be to make scientific discoveries understandable and relatable in African languages, such as isiZulu.   Biyela was given an assignment to write about the discovery of a new species of dinosaur, Ledumahadi mafube in isi-Zulu.  He explained:[5]

"But there’s no word for “dinosaur” in Zulu. Nor are there words for “Jurassic,” “fossilization,” or “evolution.” Despite the fact that Zulu—or isiZulu, as the language is called in South Africa—is spoken by some 10 million people, it simply doesn’t have the words for communicating science.

So my news piece wasn’t just a news piece. It was an attempt to tell a science story in a language that science overlooked—to help right a societal wrong. It was a small contribution among an increasing number that aim to help decolonize South African science writing. And it was rife with more pitfalls than I could have imagined. The task of describing science clearly, concisely, and accurately—already challenging in English—became exponentially more difficult in my native tongue."

At the end of his article, Biyela gives a lexicon of some of the English-isiZulu scientific terms that he used. Biyela uses technology joined with his expertise in science for his work on conveying scientific terms from English to isiZulu. He was one of the partners in Masakhane, which is discussed below.[6]

The underrepresentation of African languages online makes it more difficult to use AI as a translating tool because computers have trouble identifying datasets with which to work. Several organizations are trying to mitigate this challenge, among them the Masakhane Research Foundation. Masakhane is collaborating with the African scientific preprint server, AfricArXiv,[7] to find a way to translate the papers that AfricArXiv receives into African languages.

Masakhane is a grassroots natural language processing (NLP) network that was formed for NLP research in African languages, for Africans, by Africans. The Masakhane community consists of:[8]

 ">1000 participants from 30 African countries with diverse educations and occupations, and >3 countries outside Africa. As of February 2020, over 49 translation results for over 38 African languages have been published by over 35 contributors on GitHub."

Masakhane has a trial translation page, but the translation results do not always match those of Google Translate. For example, ‘kisukuku’ is how ‘fossil’ is translated in Google Translate. ‘Mabaki ya Wanyama’ is the translation given by Masakhane. (Most online translations use kisukuku).

Figure 1: What is the correct translation?

These efforts are just getting started. If Africa is going to join the global knowledge pool, its languages must be represented too. Both AfricarXiv and Masakhane welcome volunteers; there are other such organizations that would also appreciate assistance.

And for those who are interested in the interrelationship between AI and library and information studies, the African Library and Information Associations and Institutions (AfLIA) will host a webinar on this topic on 25 October 2023. Visit the webinar’s information page for more information.


Related articles


References and attribution

[3]Lost in Translation: Why Google Translate Often Gets Yorùbá-and Other Languages-Wrong. Aremu Adeola. Rising Voices. 20 November 2020. https://rising.globalvoices.org/blog/2020/11/20/lost-in-translation-why-google-translate-often-gets-yoruba-and-other-languages-wrong/

[4] A roadmap to help AI technologies speak African languages. 11 August 2023. https://www.sciencedaily.com/releases/2023/08/230811115430.htm

[5] Decolonizing Science Writing in South Africa. Sibusiso Biyela. 12 February 2019. https://www.theopennotebook.com/2019/02/12/decolonizing-science-writing-in-south-africa/

Image at the top of the article courtesy of albyantoniazzi, Flickr, CC BY-NC-SA