Table of Contents
Fetching ...

Neuron Specialization: Leveraging intrinsic task modularity for multilingual machine translation

Shaomu Tan, Di Wu, Christof Monz

TL;DR

The paper tackles negative interference in unified multilingual translation models by uncovering intrinsic modularity within feed-forward network neurons. It analyzes language-specific activation patterns and language-proximity overlaps in FFN layers, revealing structured, layer-dependent modularity. Building on these insights, it introduces Neuron Specialization, which identifies specialized neurons and updates only a sparse subset of FFN parameters per task, reducing interference and boosting cross-lingual transfer. Across IWSLT and EC30 benchmarks, the approach yields consistent gains with lower parameter overhead and demonstrates the value of leveraging intrinsic modular signals for scalable multilingual learning.

Abstract

Training a unified multilingual model promotes knowledge transfer but inevitably introduces negative interference. Language-specific modeling methods show promise in reducing interference. However, they often rely on heuristics to distribute capacity and struggle to foster cross-lingual transfer via isolated modules. In this paper, we explore intrinsic task modularity within multilingual networks and leverage these observations to circumvent interference under multilingual translation. We show that neurons in the feed-forward layers tend to be activated in a language-specific manner. Meanwhile, these specialized neurons exhibit structural overlaps that reflect language proximity, which progress across layers. Based on these findings, we propose Neuron Specialization, an approach that identifies specialized neurons to modularize feed-forward layers and then continuously updates them through sparse networks. Extensive experiments show that our approach achieves consistent performance gains over strong baselines with additional analyses demonstrating reduced interference and increased knowledge transfer.

Neuron Specialization: Leveraging intrinsic task modularity for multilingual machine translation

TL;DR

The paper tackles negative interference in unified multilingual translation models by uncovering intrinsic modularity within feed-forward network neurons. It analyzes language-specific activation patterns and language-proximity overlaps in FFN layers, revealing structured, layer-dependent modularity. Building on these insights, it introduces Neuron Specialization, which identifies specialized neurons and updates only a sparse subset of FFN parameters per task, reducing interference and boosting cross-lingual transfer. Across IWSLT and EC30 benchmarks, the approach yields consistent gains with lower parameter overhead and demonstrates the value of leveraging intrinsic modular signals for scalable multilingual learning.

Abstract

Training a unified multilingual model promotes knowledge transfer but inevitably introduces negative interference. Language-specific modeling methods show promise in reducing interference. However, they often rely on heuristics to distribute capacity and struggle to foster cross-lingual transfer via isolated modules. In this paper, we explore intrinsic task modularity within multilingual networks and leverage these observations to circumvent interference under multilingual translation. We show that neurons in the feed-forward layers tend to be activated in a language-specific manner. Meanwhile, these specialized neurons exhibit structural overlaps that reflect language proximity, which progress across layers. Based on these findings, we propose Neuron Specialization, an approach that identifies specialized neurons to modularize feed-forward layers and then continuously updates them through sparse networks. Extensive experiments show that our approach achieves consistent performance gains over strong baselines with additional analyses demonstrating reduced interference and increased knowledge transfer.
Paper Structure (46 sections, 3 equations, 8 figures, 8 tables)

This paper contains 46 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Pairwise Intersection over Union (IoU) scores for specialized neurons extracted from the first decoder FFN layer across all out-of-English translation directions to measure the degree of overlap. Darker cells indicate stronger overlaps, with the color threshold set from 40 to 80 to improve visibility.
  • Figure 2: Progression of distribution of IoU scores for specialized neurons across layers on the EC30 dataset. The scores are measured for different source and target languages in the Encoder and Decoder, respectively.
  • Figure 3: BLEU gains of shallower models over mT-small on IWSLT show improved X-En performance at the expense of En-X. Applying Neuron Specialization reduces EN-X degradation and amplifies X-En gains.
  • Figure 4: Improvements of Neuron Specialization method over the mT-large baseline on EC30. The x-axis indicates the factor $k$ and the dynamic sparsity of the fc1 layer, with displayed values ranging from minimum to maximum sparsity achieved. The y-axis indicates the SacreBLEU improvements over the mT-large model.
  • Figure 5: Sparsity progression of Neuron Specialization when $k=95$ on the EC30. We observe that the sparsity becomes smaller in the Encoder and then goes up in the Decoder. Note that this figure is based on the natural signals extracted from the untouched pre-trained model, and will be leveraged later in the process of Neuron Specialization Training. This intrinsic pattern naturally follows our intuition that specialized neurons progress from language specific to agnostic the in Encoder, and vice versa in the Decoder.
  • ...and 3 more figures