Table of Contents
Fetching ...

AfriHG: News headline generation for African Languages

Toyib Ogunremi, Serah Akojenu, Anthony Soronnadi, Olubayo Adekanmbi, David Ifeoluwa Adelani

TL;DR

AfriHG addresses the scarcity of African-language headline-generation data by merging XL-SUM and MasakhaNEWS to cover 16 languages. It evaluates Africa-centric seq2seq models (AfriTeVa V2) and a large LLM (Aya-101), finding AfriTeVa V2 generally yields higher Rouge scores and that Aya-101 remains competitive on many languages but struggles with non-Latin scripts. A key result is that fine-tuning AfriTeVa V2 (~$313$M) can rival prompting Aya-101 (~$13$B), highlighting the value of targeted pretraining and supervised fine-tuning for low-resource languages. The dataset and code are released to foster further research and practical headline generation across African languages, with future work exploring additional LLMs like GPT-4, Llama, and Gemma.

Abstract

This paper introduces AfriHG -- a news headline generation dataset created by combining from XLSum and MasakhaNEWS datasets focusing on 16 languages widely spoken by Africa. We experimented with two seq2eq models (mT5-base and AfriTeVa V2), and Aya-101 LLM. Our results show that Africa-centric seq2seq models such as AfriTeVa V2 outperform the massively multilingual mT5-base model. Finally, we show that the performance of fine-tuning AfriTeVa V2 with 313M parameters is competitive to prompting Aya-101 LLM with more than 13B parameters.

AfriHG: News headline generation for African Languages

TL;DR

AfriHG addresses the scarcity of African-language headline-generation data by merging XL-SUM and MasakhaNEWS to cover 16 languages. It evaluates Africa-centric seq2seq models (AfriTeVa V2) and a large LLM (Aya-101), finding AfriTeVa V2 generally yields higher Rouge scores and that Aya-101 remains competitive on many languages but struggles with non-Latin scripts. A key result is that fine-tuning AfriTeVa V2 (~M) can rival prompting Aya-101 (~B), highlighting the value of targeted pretraining and supervised fine-tuning for low-resource languages. The dataset and code are released to foster further research and practical headline generation across African languages, with future work exploring additional LLMs like GPT-4, Llama, and Gemma.

Abstract

This paper introduces AfriHG -- a news headline generation dataset created by combining from XLSum and MasakhaNEWS datasets focusing on 16 languages widely spoken by Africa. We experimented with two seq2eq models (mT5-base and AfriTeVa V2), and Aya-101 LLM. Our results show that Africa-centric seq2seq models such as AfriTeVa V2 outperform the massively multilingual mT5-base model. Finally, we show that the performance of fine-tuning AfriTeVa V2 with 313M parameters is competitive to prompting Aya-101 LLM with more than 13B parameters.
Paper Structure (10 sections, 1 table)