Table of Contents
Fetching ...

Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer

V. S. D. S. Mahesh Akavarapu, Arnab Bhattacharya

TL;DR

The paper reframes automated cognate detection as a supervised link-prediction problem and introduces a Cognate Transformer that processes multiple sequence alignments to directly predict pairwise cognacy probabilities. By integrating outer product mean representations and triangle-based transitivity updates, the model achieves strong performance with increasing supervision and offers significant speed advantages over pairwise-pair methods. Across diverse language-family datasets, CogTran2 outperforms traditional LexStat-based and other supervised baselines, particularly when labeled data are available, while also enabling transfer learning to unseen data. Limitations include lag in certain datasets and challenges with partial cognacy and borrowings, pointing to future work in refining subword cognates and extending applications to phylogenetic reconstruction.

Abstract

Identification of cognates across related languages is one of the primary problems in historical linguistics. Automated cognate identification is helpful for several downstream tasks including identifying sound correspondences, proto-language reconstruction, phylogenetic classification, etc. Previous state-of-the-art methods for cognate identification are mostly based on distributions of phonemes computed across multilingual wordlists and make little use of the cognacy labels that define links among cognate clusters. In this paper, we present a transformer-based architecture inspired by computational biology for the task of automated cognate detection. Beyond a certain amount of supervision, this method performs better than the existing methods, and shows steady improvement with further increase in supervision, thereby proving the efficacy of utilizing the labeled information. We also demonstrate that accepting multiple sequence alignments as input and having an end-to-end architecture with link prediction head saves much computation time while simultaneously yielding superior performance.

Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer

TL;DR

The paper reframes automated cognate detection as a supervised link-prediction problem and introduces a Cognate Transformer that processes multiple sequence alignments to directly predict pairwise cognacy probabilities. By integrating outer product mean representations and triangle-based transitivity updates, the model achieves strong performance with increasing supervision and offers significant speed advantages over pairwise-pair methods. Across diverse language-family datasets, CogTran2 outperforms traditional LexStat-based and other supervised baselines, particularly when labeled data are available, while also enabling transfer learning to unseen data. Limitations include lag in certain datasets and challenges with partial cognacy and borrowings, pointing to future work in refining subword cognates and extending applications to phylogenetic reconstruction.

Abstract

Identification of cognates across related languages is one of the primary problems in historical linguistics. Automated cognate identification is helpful for several downstream tasks including identifying sound correspondences, proto-language reconstruction, phylogenetic classification, etc. Previous state-of-the-art methods for cognate identification are mostly based on distributions of phonemes computed across multilingual wordlists and make little use of the cognacy labels that define links among cognate clusters. In this paper, we present a transformer-based architecture inspired by computational biology for the task of automated cognate detection. Beyond a certain amount of supervision, this method performs better than the existing methods, and shows steady improvement with further increase in supervision, thereby proving the efficacy of utilizing the labeled information. We also demonstrate that accepting multiple sequence alignments as input and having an end-to-end architecture with link prediction head saves much computation time while simultaneously yielding superior performance.
Paper Structure (27 sections, 6 equations, 1 figure, 4 tables)

This paper contains 27 sections, 6 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Architecture of Cognate Transformer with Triangle Multiplication and Attention modules