Table of Contents
Fetching ...

Automatic Text Summarization (ATS) for Research Documents in Sorani Kurdish

Rondik Hadi Abdulrahman, Hossein Hassani

TL;DR

This work addresses Automatic Text Summarization for Sorani Kurdish scientific documents by constructing a 231-document dataset from four departments and applying a sentence-weighted extractive approach using TF-IDF. Two experiments compare the impact of including vs excluding conclusions, evaluated with ROUGE metrics and manual expert judgments, achieving a best ROUGE-1 score of 19.58%. The study demonstrates the feasibility of Sorani Kurdish ATS, provides a valuable dataset and baseline methodology, and outlines concrete directions for improving preprocessing, expanding domain coverage, and pursuing abstractive approaches. These contributions advance Kurdish NLP resources and offer a practical foundation for researchers and developers working on language technologies for Sorani Kurdish.

Abstract

Extracting concise information from scientific documents aids learners, researchers, and practitioners. Automatic Text Summarization (ATS), a key Natural Language Processing (NLP) application, automates this process. While ATS methods exist for many languages, Kurdish remains underdeveloped due to limited resources. This study develops a dataset and language model based on 231 scientific papers in Sorani Kurdish, collected from four academic departments in two universities in the Kurdistan Region of Iraq (KRI), averaging 26 pages per document. Using Sentence Weighting and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms, two experiments were conducted, differing in whether the conclusions were included. The average word count was 5,492.3 in the first experiment and 5,266.96 in the second. Results were evaluated manually and automatically using ROUGE-1, ROUGE-2, and ROUGE-L metrics, with the best accuracy reaching 19.58%. Six experts conducted manual evaluations using three criteria, with results varying by document. This research provides valuable resources for Kurdish NLP researchers to advance ATS and related fields.

Automatic Text Summarization (ATS) for Research Documents in Sorani Kurdish

TL;DR

This work addresses Automatic Text Summarization for Sorani Kurdish scientific documents by constructing a 231-document dataset from four departments and applying a sentence-weighted extractive approach using TF-IDF. Two experiments compare the impact of including vs excluding conclusions, evaluated with ROUGE metrics and manual expert judgments, achieving a best ROUGE-1 score of 19.58%. The study demonstrates the feasibility of Sorani Kurdish ATS, provides a valuable dataset and baseline methodology, and outlines concrete directions for improving preprocessing, expanding domain coverage, and pursuing abstractive approaches. These contributions advance Kurdish NLP resources and offer a practical foundation for researchers and developers working on language technologies for Sorani Kurdish.

Abstract

Extracting concise information from scientific documents aids learners, researchers, and practitioners. Automatic Text Summarization (ATS), a key Natural Language Processing (NLP) application, automates this process. While ATS methods exist for many languages, Kurdish remains underdeveloped due to limited resources. This study develops a dataset and language model based on 231 scientific papers in Sorani Kurdish, collected from four academic departments in two universities in the Kurdistan Region of Iraq (KRI), averaging 26 pages per document. Using Sentence Weighting and Term Frequency-Inverse Document Frequency (TF-IDF) algorithms, two experiments were conducted, differing in whether the conclusions were included. The average word count was 5,492.3 in the first experiment and 5,266.96 in the second. Results were evaluated manually and automatically using ROUGE-1, ROUGE-2, and ROUGE-L metrics, with the best accuracy reaching 19.58%. Six experts conducted manual evaluations using three criteria, with results varying by document. This research provides valuable resources for Kurdish NLP researchers to advance ATS and related fields.

Paper Structure

This paper contains 20 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Types of Text Summarization Approaches, adapted and modified from TextSummarization.
  • Figure 2: Visual representation of Experiment 1 results.
  • Figure 3: Visual representation of Experiment 2 results.
  • Figure 4: Comparison of ROUGE-1 results from both experiments.
  • Figure 5: Comparison of Evaluator Feedback for the Kurdish Language Department Document
  • ...and 6 more figures