Table of Contents
Fetching ...

Gatekeeper: Improving Model Cascades Through Confidence Tuning

Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari

TL;DR

Gatekeeper introduces a confidence-calibration loss to improve cascaded inference between a small, cost-efficient model and a larger, more capable model. It optimizes a hybrid objective $\mathcal{L} = \alpha \mathcal{L}_{\text{corr}} + (1-\alpha) \mathcal{L}_{\text{incorr}}$, with $\mathcal{L}_{\text{corr}} = \frac{1}{N} \sum_i \mathds{1}\{ y_i = \hat{y}_i \} \text{CE}(p_i(\mathbf{x}_i), y_i)$ and $\mathcal{L}_{\text{incorr}} = \frac{1}{N} \sum_i \mathds{1}\{ y_i \neq \hat{y}_i \} \text{KL}(p_i(\mathbf{x}_i) \parallel \mathcal{U})$, and then defers uncertain cases to the large model using standard confidence- or entropy-based gates. The approach is architecture-agnostic and applies to encoder-only, decoder-only, and encoder-decoder settings across image classification, language modeling, and vision-language tasks, yielding substantial deferral-performance gains with a tunable trade-off controlled by $\alpha$. The paper situates Gatekeeper relative to related uncertainty-quantification and cascading methods, discusses limitations and ethical considerations, and provides a reproducible evaluation on multiple datasets. Overall, Gatekeeper advances cost-efficient, uncertainty-aware cascade deployment without changing model architectures.

Abstract

Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.

Gatekeeper: Improving Model Cascades Through Confidence Tuning

TL;DR

Gatekeeper introduces a confidence-calibration loss to improve cascaded inference between a small, cost-efficient model and a larger, more capable model. It optimizes a hybrid objective , with and , and then defers uncertain cases to the large model using standard confidence- or entropy-based gates. The approach is architecture-agnostic and applies to encoder-only, decoder-only, and encoder-decoder settings across image classification, language modeling, and vision-language tasks, yielding substantial deferral-performance gains with a tunable trade-off controlled by . The paper situates Gatekeeper relative to related uncertainty-quantification and cascading methods, discusses limitations and ethical considerations, and provides a reproducible evaluation on multiple datasets. Overall, Gatekeeper advances cost-efficient, uncertainty-aware cascade deployment without changing model architectures.

Abstract

Large-scale machine learning models deliver strong performance across a wide range of tasks but come with significant computational and resource constraints. To mitigate these challenges, local smaller models are often deployed alongside larger models, relying on routing and deferral mechanisms to offload complex tasks. However, existing approaches inadequately balance the capabilities of these models, often resulting in unnecessary deferrals or sub-optimal resource usage. In this work we introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups. Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model. Moreover, it incorporates a mechanism for managing the trade-off between model performance and deferral accuracy, and is broadly applicable across various tasks and domains without any architectural changes. We evaluate our method on encoder-only, decoder-only, and encoder-decoder architectures. Experiments across image classification, language modeling, and vision-language tasks show that our approach substantially improves deferral performance.

Paper Structure

This paper contains 1 section, 1 figure.

Table of Contents

  1. Introduction

Figures (1)

  • Figure :