Table of Contents
Fetching ...

Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT

Aisha Khatun, Anisur Rahman, Md Saiful Islam, Hemayet Ahmed Chowdhury, Ayesha Tasnim

TL;DR

This work tackles Bangla Authorship Attribution (AABL) by applying transfer learning via AWD-LSTM in a three-stage pipeline: pre-training on large Bangla corpora, fine-tuning on attribution data, and training a classifier on top. It introduces BAAD16, the largest Bangla AA dataset to date, and analyzes word, sub-word, and character tokenizations, finding sub-word tokenization most robust when combined with News-based pre-training. The study demonstrates state-of-the-art performance (up to 99.8% accuracy on BAAD16) and strong scalability as the number of authors grows, surpassing existing Bangla AA approaches. It also releases multiple pre-trained Bangla language models and emphasizes practical benefits for low-resource Bangla NLP tasks and downstream applications in security, plagiarism detection, and literary analysis.

Abstract

Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic features extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations of pre-trained language models for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better even with an increasing number of authors, and performance remains steady despite few training samples.

Authorship Attribution in Bangla Literature (AABL) via Transfer Learning using ULMFiT

TL;DR

This work tackles Bangla Authorship Attribution (AABL) by applying transfer learning via AWD-LSTM in a three-stage pipeline: pre-training on large Bangla corpora, fine-tuning on attribution data, and training a classifier on top. It introduces BAAD16, the largest Bangla AA dataset to date, and analyzes word, sub-word, and character tokenizations, finding sub-word tokenization most robust when combined with News-based pre-training. The study demonstrates state-of-the-art performance (up to 99.8% accuracy on BAAD16) and strong scalability as the number of authors grows, surpassing existing Bangla AA approaches. It also releases multiple pre-trained Bangla language models and emphasizes practical benefits for low-resource Bangla NLP tasks and downstream applications in security, plagiarism detection, and literary analysis.

Abstract

Authorship Attribution is the task of creating an appropriate characterization of text that captures the authors' writing style to identify the original author of a given piece of text. With increased anonymity on the internet, this task has become increasingly crucial in various security and plagiarism detection fields. Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field due to its complex linguistic feature and sentence structure. Moreover, existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author. In this paper, we propose the use of Average-Stochastic Gradient Descent Weight-Dropped Long Short-Term Memory (AWD-LSTM) architecture and an effective transfer learning approach that addresses the problem of complex linguistic features extraction and scalability for authorship attribution in Bangla Literature (AABL). We analyze the effect of different tokenization, such as word, sub-word, and character level tokenization, and demonstrate the effectiveness of these tokenizations in the proposed model. Moreover, we introduce the publicly available Bangla Authorship Attribution Dataset of 16 authors (BAAD16) containing 17,966 sample texts and 13.4+ million words to solve the standard dataset scarcity problem and release six variations of pre-trained language models for use in any Bangla NLP downstream task. For evaluation, we used our developed BAAD16 dataset as well as other publicly available datasets. Empirically, our proposed model outperformed state-of-the-art models and achieved 99.8% accuracy in the BAAD16 dataset. Furthermore, we showed that the proposed system scales much better even with an increasing number of authors, and performance remains steady despite few training samples.
Paper Structure (47 sections, 12 figures, 11 tables)

This paper contains 47 sections, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Schematic diagram of the proposed system. The first two steps are language modelling tasks, and the last step is authorship attribution. The embedding layer and the AWD-LSTM base remain the same in all steps, only the final dense layer changes. The trained weights of the fixed parts are passed on from pre-training to fine-tuning to the classification step, updated in each step.
  • Figure 2: BAAD16 dataset: Distribution per author. Authors are indicated by their index number. See indices in Table \ref{['ourcorpustable']}
  • Figure 3: Example of three kinds of tokenization of a sample text. Punctuation added for demonstration purposes.
  • Figure 4: Simplified diagram of the architectures. The base (embedding and LSTM layers) of the model remain the same to transfer learned weights, but the classifier parts are changed when the tasks are changed.
  • Figure 5: DropConnect Network dropconnect_image_cite
  • ...and 7 more figures