Table of Contents
Fetching ...

N2C2: Nearest Neighbor Enhanced Confidence Calibration for Cross-Lingual In-Context Learning

Jie He, Simon Yu, Deyi Xiong, Víctor Gutiérrez-Basulto, Jeff Z. Pan

TL;DR

This work addresses the reliability gap in cross-lingual in-context learning by introducing N2C2, a $k$-NN augmented calibration framework that uses a source-language datastore to improve both accuracy and confidence estimation in multilingual sentiment classification. N2C2 combines a semantically aware retrieval representation, a confidence-aware distribution, and adaptive merging across multiple retrieval sizes to robustly leverage limited demonstrations. Empirical results on MARC and CLS show consistent accuracy gains and substantial reductions in expected calibration error compared to strong baselines and classical calibration methods, with further gains observed on larger model variants. The approach demonstrates practical impact for deploying cross-lingual ICL in real-world multilingual settings where prediction reliability is crucial, while highlighting avenues for extension to other tasks and decoder-based models.

Abstract

Recent advancements of in-context learning (ICL) show language models can significantly improve their performance when demonstrations are provided. However, little attention has been paid to model calibration and prediction confidence of ICL in cross-lingual scenarios. To bridge this gap, we conduct a thorough analysis of ICL for cross-lingual sentiment classification. Our findings suggest that ICL performs poorly in cross-lingual scenarios, exhibiting low accuracy and presenting high calibration errors. In response, we propose a novel approach, N2C2, which employs a -nearest neighbors augmented classifier for prediction confidence calibration. N2C2 narrows the prediction gap by leveraging a datastore of cached few-shot instances. Specifically, N2C2 integrates the predictions from the datastore and incorporates confidence-aware distribution, semantically consistent retrieval representation, and adaptive neighbor combination modules to effectively utilize the limited number of supporting instances. Evaluation on two multilingual sentiment classification datasets demonstrates that N2C2 outperforms traditional ICL. It surpasses fine tuning, prompt tuning and recent state-of-the-art methods in terms of accuracy and calibration errors.

N2C2: Nearest Neighbor Enhanced Confidence Calibration for Cross-Lingual In-Context Learning

TL;DR

This work addresses the reliability gap in cross-lingual in-context learning by introducing N2C2, a -NN augmented calibration framework that uses a source-language datastore to improve both accuracy and confidence estimation in multilingual sentiment classification. N2C2 combines a semantically aware retrieval representation, a confidence-aware distribution, and adaptive merging across multiple retrieval sizes to robustly leverage limited demonstrations. Empirical results on MARC and CLS show consistent accuracy gains and substantial reductions in expected calibration error compared to strong baselines and classical calibration methods, with further gains observed on larger model variants. The approach demonstrates practical impact for deploying cross-lingual ICL in real-world multilingual settings where prediction reliability is crucial, while highlighting avenues for extension to other tasks and decoder-based models.

Abstract

Recent advancements of in-context learning (ICL) show language models can significantly improve their performance when demonstrations are provided. However, little attention has been paid to model calibration and prediction confidence of ICL in cross-lingual scenarios. To bridge this gap, we conduct a thorough analysis of ICL for cross-lingual sentiment classification. Our findings suggest that ICL performs poorly in cross-lingual scenarios, exhibiting low accuracy and presenting high calibration errors. In response, we propose a novel approach, N2C2, which employs a -nearest neighbors augmented classifier for prediction confidence calibration. N2C2 narrows the prediction gap by leveraging a datastore of cached few-shot instances. Specifically, N2C2 integrates the predictions from the datastore and incorporates confidence-aware distribution, semantically consistent retrieval representation, and adaptive neighbor combination modules to effectively utilize the limited number of supporting instances. Evaluation on two multilingual sentiment classification datasets demonstrates that N2C2 outperforms traditional ICL. It surpasses fine tuning, prompt tuning and recent state-of-the-art methods in terms of accuracy and calibration errors.

Paper Structure

This paper contains 21 sections, 9 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Illustration of cross-lingual nearest neighbor inference with $k$ = 3, which makes a prediction from 5 candidate words.
  • Figure 2: Comparison the performance of the cross-lingual ICL and Contextual Calibration (ICL+CC) across 4 languages with $\textbf{Accuracy}\uparrow(\%)$ (left) and $\textbf{ECE}\downarrow(\%)$ (right).
  • Figure 3: Diagram of N2C2 with $k$ = 16. N2C2 first reconstructs $\text{h}_{\text{[mask]}} (\S \ref{['4.2']})$ for the test example in the target language, and selects neighbors (§ \ref{['4.1']}) for it. It then consider confidence to generate multiple distributions (§ \ref{['4.3']}). These distributions are summed up together to form the final predicted distribution (§ \ref{['adaptive combine']}).
  • Figure 4: Top-$k$ Ablation Study
  • Figure 5: Comparsion with different model sizes and architectures on MARC. We provide the average scores over all languages.
  • ...and 2 more figures