IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Akhilesh Aravapalli; Mounika Marreddy; Radhika Mamidi; Manish Gupta; Subba Reddy Oota

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Akhilesh Aravapalli, Mounika Marreddy, Radhika Mamidi, Manish Gupta, Subba Reddy Oota

TL;DR

We address encoding of linguistic properties in Indic languages by introducing IndicSentEval, a benchmark of approximately $\sim 47K$ sentences, and evaluating 9 multilingual Transformer models on 8 probing tasks across 6 Indic languages. We systematically analyze 13 input perturbations to gauge robustness, using a lightweight frozen-probe setup across layer representations. Results show Indic-specific models (IndicBERT, MuRIL) excel at encoding Indic linguistic properties, while universal models exhibit broader robustness to perturbations, with decoder-based universals performing particularly well on several tasks. A downstream-correlation analysis with IndicGLUE suggests probing signals predict performance on real-world tasks for morphologically rich Indic languages, underscoring the value of targeted probing and perturbation analyses for multilingual model design.

Abstract

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

TL;DR

We address encoding of linguistic properties in Indic languages by introducing IndicSentEval, a benchmark of approximately

sentences, and evaluating 9 multilingual Transformer models on 8 probing tasks across 6 Indic languages. We systematically analyze 13 input perturbations to gauge robustness, using a lightweight frozen-probe setup across layer representations. Results show Indic-specific models (IndicBERT, MuRIL) excel at encoding Indic linguistic properties, while universal models exhibit broader robustness to perturbations, with decoder-based universals performing particularly well on several tasks. A downstream-correlation analysis with IndicGLUE suggests probing signals predict performance on real-world tasks for morphologically rich Indic languages, underscoring the value of targeted probing and perturbation analyses for multilingual model design.

Abstract

47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].

Paper Structure (20 sections, 18 figures, 21 tables)

This paper contains 20 sections, 18 figures, 21 tables.

Introduction
IndicSentEval Dataset
Text Perturbation Analysis
Methodology
Experimental Results
Probing Results
Perturbation Results
Correlation Analysis of Probing with Downstream Tasks
Discussion and Conclusion
Limitations
Ethics Statement
Related Work
SSF format
INDICSENTEVAL dataset statistics
Probing Tasks
...and 5 more sections

Figures (18)

Figure 1: We evaluate 9 multilingual Transformer models on 8 probing tasks in 6 Indic languages using our IndicSentEval dataset. We analyze the effects of 13 perturbations on the performance of these models.
Figure 2: Probing task results: Layerwise accuracy comparisons between various multilingual representations on surface (top row) and syntactic (bottom two rows) probing tasks. We report the layerwise probing accuracies for individual multilingual models in Figs. \ref{['fig:hi_probing_tasks']} to \ref{['fig:ur_probing_tasks']} in Appendix \ref{['probing_results']}.
Figure 3: Probing task results: Layerwise accuracy comparisons between various multilingual representations on semantic probing tasks. For Malayalam, there is an absence of SSF data for the VerbGen, VerbPer, and VerbNum tasks. We report the layerwise probing accuracies for individual multilingual models in Figs. \ref{['fig:hi_probing_tasks']} to \ref{['fig:ur_probing_tasks']} in Appendix \ref{['probing_results']}.
Figure 4: A sample of an SSF formatted sentence in Hindi language.
Figure 5: Hindi language probing task results: Layerwise accuracy comparisons between various multilingual representations on 8 probing tasks.
...and 13 more figures

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

TL;DR

Abstract

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Authors

TL;DR

Abstract

Table of Contents

Figures (18)