Table of Contents
Fetching ...

Headline-Guided Extractive Summarization for Thai News Articles

Pimpitchaya Kositcharoensuk, Nakarin Sritrakool, Ploy N. Pratanwanich

TL;DR

The paper tackles Thai news extractive summarization by incorporating headline information through the CHIMA framework, which interleaves embedding, BERT-based encoding, a sentence-selection layer, and a headline-guided reranking module. It introduces two aggregation strategies, CHIMA-SA and CHIMA-HM, to fuse body-based predictions with headline-body semantic similarity. Evaluated on the ThaiSum dataset, CHIMA variants consistently outperform strong baselines across ROUGE, BLEU, and F1 metrics, with notable gains in recalling sentences scattered through the article. This headline-guided approach offers a practical and effective solution for Thai, a low-resource language, and demonstrates the broader value of headline information in extractive summarization tasks.

Abstract

Text summarization is a process of condensing lengthy texts while preserving their essential information. Previous studies have predominantly focused on high-resource languages, while low-resource languages like Thai have received less attention. Furthermore, earlier extractive summarization models for Thai texts have primarily relied on the article's body, without considering the headline. This omission can result in the exclusion of key sentences from the summary. To address these limitations, we propose CHIMA, an extractive summarization model that incorporates the contextual information of the headline for Thai news articles. Our model utilizes a pre-trained language model to capture complex language semantics and assigns a probability to each sentence to be included in the summary. By leveraging the headline to guide sentence selection, CHIMA enhances the model's ability to recover important sentences and discount irrelevant ones. Additionally, we introduce two strategies for aggregating headline-body similarities, simple average and harmonic mean, providing flexibility in sentence selection to accommodate varying writing styles. Experiments on publicly available Thai news datasets demonstrate that CHIMA outperforms baseline models across ROUGE, BLEU, and F1 scores. These results highlight the effectiveness of incorporating the headline-body similarities as model guidance. The results also indicate an enhancement in the model's ability to recall critical sentences, even those scattered throughout the middle or end of the article. With this potential, headline-guided extractive summarization offers a promising approach to improve the quality and relevance of summaries for Thai news articles.

Headline-Guided Extractive Summarization for Thai News Articles

TL;DR

The paper tackles Thai news extractive summarization by incorporating headline information through the CHIMA framework, which interleaves embedding, BERT-based encoding, a sentence-selection layer, and a headline-guided reranking module. It introduces two aggregation strategies, CHIMA-SA and CHIMA-HM, to fuse body-based predictions with headline-body semantic similarity. Evaluated on the ThaiSum dataset, CHIMA variants consistently outperform strong baselines across ROUGE, BLEU, and F1 metrics, with notable gains in recalling sentences scattered through the article. This headline-guided approach offers a practical and effective solution for Thai, a low-resource language, and demonstrates the broader value of headline information in extractive summarization tasks.

Abstract

Text summarization is a process of condensing lengthy texts while preserving their essential information. Previous studies have predominantly focused on high-resource languages, while low-resource languages like Thai have received less attention. Furthermore, earlier extractive summarization models for Thai texts have primarily relied on the article's body, without considering the headline. This omission can result in the exclusion of key sentences from the summary. To address these limitations, we propose CHIMA, an extractive summarization model that incorporates the contextual information of the headline for Thai news articles. Our model utilizes a pre-trained language model to capture complex language semantics and assigns a probability to each sentence to be included in the summary. By leveraging the headline to guide sentence selection, CHIMA enhances the model's ability to recover important sentences and discount irrelevant ones. Additionally, we introduce two strategies for aggregating headline-body similarities, simple average and harmonic mean, providing flexibility in sentence selection to accommodate varying writing styles. Experiments on publicly available Thai news datasets demonstrate that CHIMA outperforms baseline models across ROUGE, BLEU, and F1 scores. These results highlight the effectiveness of incorporating the headline-body similarities as model guidance. The results also indicate an enhancement in the model's ability to recall critical sentences, even those scattered throughout the middle or end of the article. With this potential, headline-guided extractive summarization offers a promising approach to improve the quality and relevance of summaries for Thai news articles.

Paper Structure

This paper contains 28 sections, 19 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of the proposed CHIMA model.
  • Figure 2: A sample from the ThaiSum dataset.
  • Figure 3: Statistics of the ThaiSum dataset. Distributions of the number of tokens in the headline, summary, and article's body parts (a). Distribution of the number of sentences (b), the body-summary compression percentages (c), and the sentence indices of all summary labels from Oracle (d).
  • Figure 4: Headline-body cosine similarities for summary and non-summary sentences.
  • Figure 5: Contour plots of summarization probabilities based on varying selection and similarity scores.
  • ...and 4 more figures