Table of Contents
Fetching ...

Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal

TL;DR

Indic-TunedLens addresses the challenge of interpreting multilingual LLMs in India by introducing language-specific affine transformations that align intermediate hidden states with the final output distributions. Building on the Tuned Lens framework, it trains a shared affine translator across 10 Indian languages and evaluates on a multilingual MMLU subset using the Sarvam-1 model, demonstrating improved layer-wise interpretability, lower entropy progression, and better token-ranking across languages. The findings reveal that interpretability is not language-agnostic; language-aware projections uncover distinct processing patterns in morphologically rich languages and offer a practical tool for more equitable multilingual AI. This work advances multilingual interpretability by providing a concrete, language-sensitive method with demonstrated gains in early-layer decoding fidelity and cross-language transfer analysis, with potential impact on broader morphology-rich language communities.

Abstract

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.

Indic-TunedLens: Interpreting Multilingual Models in Indian Languages

TL;DR

Indic-TunedLens addresses the challenge of interpreting multilingual LLMs in India by introducing language-specific affine transformations that align intermediate hidden states with the final output distributions. Building on the Tuned Lens framework, it trains a shared affine translator across 10 Indian languages and evaluates on a multilingual MMLU subset using the Sarvam-1 model, demonstrating improved layer-wise interpretability, lower entropy progression, and better token-ranking across languages. The findings reveal that interpretability is not language-agnostic; language-aware projections uncover distinct processing patterns in morphologically rich languages and offer a practical tool for more equitable multilingual AI. This work advances multilingual interpretability by providing a concrete, language-sensitive method with demonstrated gains in early-layer decoding fidelity and cross-language transfer analysis, with potential impact on broader morphology-rich language communities.

Abstract

Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.
Paper Structure (21 sections, 4 equations, 27 figures, 3 tables)

This paper contains 21 sections, 4 equations, 27 figures, 3 tables.

Figures (27)

  • Figure 1: This figure shows the entropy heatmap of the standard Tuned Lens, which was developed for English-centric models. The high and irregular entropy across layers suggests unstable intermediate representations and weak alignment for Indian languages, with predictions biased toward English tokens.
  • Figure 2: Entropy heatmap for the Indic-TunedLens. Entropy decreases more smoothly across layers, indicating progressive information consolidation and improved semantic alignment, with intermediate predictions increasingly generating meaningful Hindi tokens.
  • Figure 3: Input Question from MMLU Hindi Dataset
  • Figure 4: Layer-wise Improvement Patterns
  • Figure 5: Layer Wise Accuracy Comparison
  • ...and 22 more figures