Table of Contents
Fetching ...

Low-resourced Languages and Online Knowledge Repositories: A Need-Finding Study

Hellina Hailu Nigatu, John Canny, Sarah E. Chasins

TL;DR

This study investigates challenges faced by contributors using Online Knowledge Repositories (OKRs) for low-resource Ethiopian languages (Afan Oromo, Amharic, Tigrinya). It employs two empirical methods—a forum-analysis of Wikipedia Talk Pages and a contextual inquiry with 14 novice contributors—to uncover how language scripts, limited resources, and socio-political factors impede content creation. Key findings show struggles with non-Latin input, misspellings, translation quality, limited scholarly sources, and interface barriers, all of which constrain article quantity and quality. The work offers design opportunities to improve Wikipedia interfaces, information retrieval, machine translation, and input modalities, with an emphasis on preserving linguistic and cultural agency. Overall, the paper argues for decolonial, community-centered technology design to empower low-resource language speakers to preserve and share knowledge in their own languages.

Abstract

Online Knowledge Repositories (OKRs) like Wikipedia offer communities a way to share and preserve information about themselves and their ways of living. However, for communities with low-resourced languages -- including most African communities -- the quality and volume of content available are often inadequate. One reason for this lack of adequate content could be that many OKRs embody Western ways of knowledge preservation and sharing, requiring many low-resourced language communities to adapt to new interactions. To understand the challenges faced by low-resourced language contributors on the popular OKR Wikipedia, we conducted (1) a thematic analysis of Wikipedia forum discussions and (2) a contextual inquiry study with 14 novice contributors. We focused on three Ethiopian languages: Afan Oromo, Amharic, and Tigrinya. Our analysis revealed several recurring themes; for example, contributors struggle to find resources to corroborate their articles in low-resourced languages, and language technology support, like translation systems and spellcheck, result in several errors that waste contributors' time. We hope our study will support designers in making online knowledge repositories accessible to low-resourced language speakers.

Low-resourced Languages and Online Knowledge Repositories: A Need-Finding Study

TL;DR

This study investigates challenges faced by contributors using Online Knowledge Repositories (OKRs) for low-resource Ethiopian languages (Afan Oromo, Amharic, Tigrinya). It employs two empirical methods—a forum-analysis of Wikipedia Talk Pages and a contextual inquiry with 14 novice contributors—to uncover how language scripts, limited resources, and socio-political factors impede content creation. Key findings show struggles with non-Latin input, misspellings, translation quality, limited scholarly sources, and interface barriers, all of which constrain article quantity and quality. The work offers design opportunities to improve Wikipedia interfaces, information retrieval, machine translation, and input modalities, with an emphasis on preserving linguistic and cultural agency. Overall, the paper argues for decolonial, community-centered technology design to empower low-resource language speakers to preserve and share knowledge in their own languages.

Abstract

Online Knowledge Repositories (OKRs) like Wikipedia offer communities a way to share and preserve information about themselves and their ways of living. However, for communities with low-resourced languages -- including most African communities -- the quality and volume of content available are often inadequate. One reason for this lack of adequate content could be that many OKRs embody Western ways of knowledge preservation and sharing, requiring many low-resourced language communities to adapt to new interactions. To understand the challenges faced by low-resourced language contributors on the popular OKR Wikipedia, we conducted (1) a thematic analysis of Wikipedia forum discussions and (2) a contextual inquiry study with 14 novice contributors. We focused on three Ethiopian languages: Afan Oromo, Amharic, and Tigrinya. Our analysis revealed several recurring themes; for example, contributors struggle to find resources to corroborate their articles in low-resourced languages, and language technology support, like translation systems and spellcheck, result in several errors that waste contributors' time. We hope our study will support designers in making online knowledge repositories accessible to low-resourced language speakers.
Paper Structure (44 sections, 5 figures, 6 tables)

This paper contains 44 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Plot showing the distribution of sentence count per article for Amharic, Tigrinya, Afan Oromo, and Arabic Wikipedia. We observe that for Amharic, Tigrinya, and Afan Oromo, the distributions have a spike at the beginning, with most articles having only one or two sentences. On the other hand, the distribution for the Arabic data shows a spike at 10 sentences. We also observe a concentration in the middle of the distributions, with some articles having from 20 to 200 sentences for Afan Oromo and Amharic and from 10 to 30 sentences for Tigrinya. For Arabic, the distribution is shifted to the right, with articles in the thousands of sentence count ranges having a dense distribution. We observe a plateau for all Wikipedia with a higher number of sentences only having a single article count.
  • Figure 2: Bar plots showing the different types of articles links on the main pages lead to for each of the three Wikipedia. In this figure, we observe for Afan Oromo, around 35% of the links in the main page link to one paragraph definition of the topic without further links. Additionally, we found 4.44% of the links to lead to political posts that have nothing to do with the topic at hand and another 4.44% to lead to non-existent pages. For Amharic Wiki, we found that 40% of the links link to another category while 9.9% of the articles talk about COVID-19, unrelated to the category at hand, and in one case be an actual research paper published on Wikipedia. Tigrinya Wikipedia had the lowest percentage of articles leading to a relevant, full article (3.33%). A majority of the links on the Tigrinya Wiki main page were links to other categories. Moreover, 13.33% of the articles were stubs with just single words in another language or just one-sentence definitions of something related to the topic. Lastly, 20% of the links led to pages that do not exist.
  • Figure 3: P9's timeline showing challenges he faced as he was trying to create an article in Tigrinya. He first tries to create an account in Tigrinya Wikipedia and notices the keyboard is in Latin. Once he has an account, he follows one of the links on the main page to find an article to edit but finds a "Page not found" notice. He tried to create a new page in that category but could not get the Tigrinya keyboard to work on Wikipedia. He then decides to restart his computer, which takes half an hour before he can get back to log in on Wikipedia. Now, when he tries to create a new page for the category, he is told his IP address is blocked. Then, he disconnects his VPN which also ends up disconnecting him from the internet. Once he reconnects, he tries to find a different category to edit and finds a stub article. He tries to edit the stub but gets a notice that his IP address is still blocked. At this point, he decided to write the article in Microsoft Word and copy it to Wikipedia. After writing his article, he tries to copy on Wikipedia but is still told his IP is blocked. P9 gives up at this point, saving his article on his local computer and saying he will try again some other time. (see Fig. \ref{['fig:p9_screen']} in Appendix \ref{['screenshots']} for larger screenshots)
  • Figure 4: Figures showing screenshots of the interaction of P12 in trying to find references for an article they wanted to edit on Wikipedia about a town in the Tigray region of Ethiopia. The search experience resulted in contextually irrelevant results and was skewed by recent events. Finally, the participant changed the topic they wanted to write about because they could not find resources. All search was conducted on google.com after the participant disconnected from VPN.
  • Figure 5: Screenshots from the session with P9 and P4. In Fig. \ref{['fig:page_does_not_exist']}, P9 and P4 followed the link to "qwanqwa" which means 'language' from the main page and ended up with the interface above, telling them the page does not exist and allowing them to edit it. We have redacted the username of our participant in the top right corner. Note the interface contains a mix of English and Tigrinya. In Fig. \ref{['fig:IP_blocked']} P9 received a notice showing that their IP is blocked. In this case, the error message in Tigrinya says "You don't have permission to create this page due to the following reasons:" and continues to state the error in English. Wikipedia policy noauthor_helpi_2023 states that reasons for being blocked could be "Using a VPN or other anonymizing proxy service". The policy further states that one can submit an appeal or request IP block exemption.