Table of Contents
Fetching ...

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

Hongbin Zhang, Kehai Chen, Xuefen Bai, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

TL;DR

DIBJudge is proposed, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch, thereby encouraging effective disentanglement.

Abstract

Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.

Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

TL;DR

DIBJudge is proposed, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch, thereby encouraging effective disentanglement.

Abstract

Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.
Paper Structure (52 sections, 3 theorems, 52 equations, 8 figures, 21 tables)

This paper contains 52 sections, 3 theorems, 52 equations, 8 figures, 21 tables.

Key Result

Proposition 3.1

Let $Z_r$ be a continuous random variable, with variational posterior $q_{\phi}(Z_r|X)$ and fixed prior $p(Z_r)$. Then $I(X; Z_r) \leq \mathbb{E}_{x \sim p(X)} \left[ D_{\mathrm{KL}}(q_{\phi}(Z_r|x) \| p(Z_r)) \right].$

Figures (8)

  • Figure 1: Translationese Bias Severity of GPT-4o across languages. Languages are sorted by resource availability from low (top) to high (bottom). The trend line illustrates the inverse relationship between resource availability and translationese bias.
  • Figure 2: Correlation analysis of judge preference with confounding factors. (a) Machine win rate decreases monotonically as CAD increases, indicating that judge preference spuriously tracks latent manifold isomorphism with English. (b) SSR distributions exhibit a clear drift between human-win and machine-win cases, showing that the judge systematically favors higher-likelihood outputs. (c) ROC curves confirm that both CAD and SSR reliably predict judge outcomes, reinforcing the attribution that translationese bias is mediated by latent manifold isomorphism with English and high predicative confidence.
  • Figure 3: Overview of our DIBJudge, which grounded in Equation \ref{['eq:dib_objective']}. (1) employs a robust encoder $g_{\phi_r}$ and a bias encoder $g_{\phi_b}$ to separate the input $X$ into robust representations $\mathbf{Z}_r$ and bias representations $\mathbf{Z}_b$. (2) introduces a variational bottleneck to minimize the mutual information $I(X; Z_r)$. (3) leverages the compressed $\mathbf{Z}_r$ through LLM Judge optimized using LoRA hu2022lora to generate the final output $Y$ by maximizing $I(Y; Z_r)$. (4) ensures feature independence by minimizing the dependence $I(Z_r; Z_b)$ between the robust and bias branches. (5) explicitly captures spurious attributes $S$ within $\mathbf{Z}_b$ by maximizing $I(S; Z_b)$ through two proxy tasks: (i) cross-lingual alignment contrastive learning and (ii) predictive confidence estimation via log-probability bin classification.
  • Figure 4: Bias severity across resource tiers.$\mathcal{S}_{\text{bias}}$ (lower is better) on Belebele, Aya, and XL-Sum. DIBJudge reduces bias across all tiers, with average reductions of $80\%$, $56\%$, and $75\%$, and the strongest improvements in Low-Resource settings. Error bars show std over 3 runs; Avg $\Delta$ is relative to Vanilla SFT.
  • Figure 5: Bias--utility Pareto Frontier. Trade-off between Bias Severity ($\downarrow$; x-axis) and m-RewardBench accuracy ($\uparrow$; y-axis). Each point corresponds to a bottleneck strength $\beta$ (log-scaled, color-coded). The resulting Pareto frontiers are traced by DIBJudge (solid) and the Vanilla IB baseline (dashed). DIBJudge consistently achieves higher accuracy at comparable bias levels across $\beta$, yielding a uniformly superior bias--utility trade-off. Markers indicate representative SOTA models, which DIBJudge outperforms in terms of lower bias and higher accuracy.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3
  • proof
  • proof
  • proof