Table of Contents
Fetching ...

Generative Score Inference for Multimodal Data

Xinyu Tian, Xiaotong Shen

Abstract

Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable limitations, including rigid assumptions and limited generalizability, constraining their effectiveness across diverse supervised learning tasks. To overcome these limitations, we introduce Generative Score Inference (GSI), a flexible inference framework capable of constructing statistically valid and informative prediction and confidence sets across a wide range of multimodal learning problems. GSI utilizes synthetic samples generated by deep generative models to approximate conditional score distributions, facilitating precise uncertainty quantification without imposing restrictive assumptions about the data or tasks. We empirically validate GSI's capabilities through two representative scenarios: hallucination detection in large language models and uncertainty estimation in image captioning. Our method achieves state-of-the-art performance in hallucination detection and robust predictive uncertainty in image captioning, and its performance is positively influenced by the quality of the underlying generative model. These findings underscore the potential of GSI as a versatile inference framework, significantly enhancing uncertainty quantification and trustworthiness in multimodal learning.

Generative Score Inference for Multimodal Data

Abstract

Accurate uncertainty quantification is crucial for making reliable decisions in various supervised learning scenarios, particularly when dealing with complex, multimodal data such as images and text. Current approaches often face notable limitations, including rigid assumptions and limited generalizability, constraining their effectiveness across diverse supervised learning tasks. To overcome these limitations, we introduce Generative Score Inference (GSI), a flexible inference framework capable of constructing statistically valid and informative prediction and confidence sets across a wide range of multimodal learning problems. GSI utilizes synthetic samples generated by deep generative models to approximate conditional score distributions, facilitating precise uncertainty quantification without imposing restrictive assumptions about the data or tasks. We empirically validate GSI's capabilities through two representative scenarios: hallucination detection in large language models and uncertainty estimation in image captioning. Our method achieves state-of-the-art performance in hallucination detection and robust predictive uncertainty in image captioning, and its performance is positively influenced by the quality of the underlying generative model. These findings underscore the potential of GSI as a versatile inference framework, significantly enhancing uncertainty quantification and trustworthiness in multimodal learning.

Paper Structure

This paper contains 18 sections, 2 theorems, 25 equations, 4 figures, 5 tables.

Key Result

Theorem 3.2

Under Assumption A_generator, for any Monte Carlo tolerance error $\varepsilon > 0$ and generation tolerance error $\tau > 0$, the conditional probability coverage error of prediction sets given $\bm x_{\text{new}}$ is bounded by: with probability at least $1 - 2\exp(-2m\varepsilon^2) - \beta(\tau, n_s)$. Consequently, with probability tending to one.

Figures (4)

  • Figure 1: Overview of (a) the QA system and (b) the image captioning model.
  • Figure 2: Pipeline of GSI method.
  • Figure 3: Type I error and power comparison for SE farquhar2024detecting, CA gui2024conformal, and GSI across varying $\alpha \in (0,1)$. Smaller $\alpha$ indicates higher confidence $(1 - \alpha)$. The diagonal line $y = x$ represents ideal Type I error control.
  • Figure 4: FDR and power for CA gui2024conformal and GSI across target FDR levels; the line $y=x$ represents perfect FDR control.

Theorems & Definitions (4)

  • Theorem 3.2: GSI's conditional coverage
  • proof : Proof of Theorem 3.2:
  • Theorem B.2: Generation error of diffusion models
  • proof