Table of Contents
Fetching ...

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Gopichand Kanumolu, Lokesh Madasu, Nirmal Surange, Manish Shrivastava

TL;DR

The paper addresses the lack of labeled data for relevance-based Telugu headline classification by introducing TeClass, the first large human-annotated Telugu dataset with 26,178 article–headline pairs and 78,534 annotations. It benchmarks traditional feature-based ML and state-of-the-art BERT-based models, finding that transformer models, especially mDeBERTa, achieve the best classification performance (F1 weighted ≈ 0.63, F1 macro ≈ 0.64). The authors demonstrate that training headline generation models on highly relevant article–headline pairs yields about a 5-point improvement in ROUGE-L scores, and provide guidance on class-aware fine-tuning. The dataset and annotation guidelines are publicly available, enabling future work in Telugu NLP and related generation and classification tasks across low-resource languages.

Abstract

News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

TL;DR

The paper addresses the lack of labeled data for relevance-based Telugu headline classification by introducing TeClass, the first large human-annotated Telugu dataset with 26,178 article–headline pairs and 78,534 annotations. It benchmarks traditional feature-based ML and state-of-the-art BERT-based models, finding that transformer models, especially mDeBERTa, achieve the best classification performance (F1 weighted ≈ 0.63, F1 macro ≈ 0.64). The authors demonstrate that training headline generation models on highly relevant article–headline pairs yields about a 5-point improvement in ROUGE-L scores, and provide guidance on class-aware fine-tuning. The dataset and annotation guidelines are publicly available, enabling future work in Telugu NLP and related generation and classification tasks across low-resource languages.

Abstract

News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.
Paper Structure (13 sections, 5 figures, 6 tables)

This paper contains 13 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Category distribution in TeClass. HREL: Highly Related, MREL: Moderately Related, LREL: Least Related
  • Figure 2: Examples of relevance-based headline classification for each category
  • Figure 3: News website distribution in TeClass
  • Figure 4: News domain distribution in TeClass
  • Figure 5: Confusion matrix between actual and predicted categories of mDeBERTa model