Table of Contents
Fetching ...

$C^3$: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding

Taixi Lu, Haoyu Wang, Huajie Shao, Jing Gao, Huaxiu Yao

TL;DR

The paper tackles the challenge of deploying cross-lingual NLU with resource-intensive multilingual PLMs by addressing confidence miscalibration in cascade inference. It introduces the Confidence Calibration Cascade ($C^3$), a lightweight plug-in calibration at the cascade base that applies logit normalization during training and temperature scaling at inference to stabilize model confidences across languages and model sizes, enabling more reliable model selection. The approach is extended to both encoder-only LMs and large language models (via prompting and entropy-based calibration for generation), and is shown through extensive experiments on five cross-lingual benchmarks to achieve state-of-the-art efficiency-accuracy trade-offs. Calibration analysis demonstrates substantial reductions in expected calibration error and robustness across languages, with only modest sensitivity to the hyper-parameter $ au$. Overall, $C^3$ significantly enhances inference efficiency for cross-lingual NLU without sacrificing much accuracy, supporting practical deployment in real-world, multilingual settings.

Abstract

Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP). Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-time systems. Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models, based on model confidence scores. Nonetheless, deep models tend to exhibit overconfidence, and confidence distributions vary across languages. This leads to the emission of confident but incorrect predictions by smaller models, hindering their ability to generalize effectively across test languages. In this study, we introduce a confidence calibration model cascade ($C^3$) method. This approach, simple yet effective, involves calibration prior to cascade inference, thereby enhancing cascade accuracy through more reliable predictions. Extensive experiments conducted on three cross-lingual benchmarks demonstrate that $C^3$ significantly outperforms all state-of-the-art baselines.

$C^3$: Confidence Calibration Model Cascade for Inference-Efficient Cross-Lingual Natural Language Understanding

TL;DR

The paper tackles the challenge of deploying cross-lingual NLU with resource-intensive multilingual PLMs by addressing confidence miscalibration in cascade inference. It introduces the Confidence Calibration Cascade (), a lightweight plug-in calibration at the cascade base that applies logit normalization during training and temperature scaling at inference to stabilize model confidences across languages and model sizes, enabling more reliable model selection. The approach is extended to both encoder-only LMs and large language models (via prompting and entropy-based calibration for generation), and is shown through extensive experiments on five cross-lingual benchmarks to achieve state-of-the-art efficiency-accuracy trade-offs. Calibration analysis demonstrates substantial reductions in expected calibration error and robustness across languages, with only modest sensitivity to the hyper-parameter . Overall, significantly enhances inference efficiency for cross-lingual NLU without sacrificing much accuracy, supporting practical deployment in real-world, multilingual settings.

Abstract

Cross-lingual natural language understanding (NLU) is a critical task in natural language processing (NLP). Recent advancements have seen multilingual pre-trained language models (mPLMs) significantly enhance the performance of these tasks. However, mPLMs necessitate substantial resources and incur high computational costs during inference, posing challenges for deployment in real-world and real-time systems. Existing model cascade methods seek to enhance inference efficiency by greedily selecting the lightest model capable of processing the current input from a variety of models, based on model confidence scores. Nonetheless, deep models tend to exhibit overconfidence, and confidence distributions vary across languages. This leads to the emission of confident but incorrect predictions by smaller models, hindering their ability to generalize effectively across test languages. In this study, we introduce a confidence calibration model cascade () method. This approach, simple yet effective, involves calibration prior to cascade inference, thereby enhancing cascade accuracy through more reliable predictions. Extensive experiments conducted on three cross-lingual benchmarks demonstrate that significantly outperforms all state-of-the-art baselines.
Paper Structure (28 sections, 7 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: An illustration of our $\textbf{C}^{3}$ framework (for classification task) for speeding up natural language inference yet retain the most accuracy, especially in OOD data. We leverage Logit Normalization at training time and Temperature Scaling at inference time to calibrate each model so that the model will yield more reliable confidence score for cascade decisions. For Large Language Model (e.g., GPT-4, Llama) inference where there is no training involved, we simply remove the training module. The $\lambda$ represents the confidence score.
  • Figure 2: Example $\textbf{C}^{3}$ one-shot prompt on Llama-2 for PAWS-X task. The answer candidates set would be {Yes, No}, and the embedding set would be {y, n}, where y is the Llama embedding of the word "Yes," and n is the Llama embedding for the word "No."
  • Figure 3: ECE comparison between Cascade (in orange) and $\textbf{C}^{3}$ (in blue). In each subfigure, the top two figures are the ECE of the smallest model in these two methods, and the bottom two figures are the overall ECE.
  • Figure 4: $\textbf{C}^{3}$ accuracy on PAWS-X w.r.t. $\tau$ values.
  • Figure 5: Case study of five languages.
  • ...and 2 more figures