Table of Contents
Fetching ...

Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

Sayak Nag, Udita Ghosh, Calvin-Khang Ta, Sarosij Bose, Jiachen Li, Amit K Roy Chowdhury

TL;DR

This work introduces PC-SGG, a post-hoc, model-agnostic Conformal Prediction framework for uncertainty quantification in Scene Graph Generation. It builds class-conditional prediction sets for objects and predicates, combines them into triplet sets with formal coverage guarantees, and further refines these sets using an MLLM-based plausibility filter via MCQA prompts and in-context learning. Empirical evaluation on VG150 across five SGG backbones shows that PC-SGG provides calibrated uncertainty estimates, improves tail-class recall, and substantially reduces set sizes with minimal impact on overall coverage. The combination enables generation of diverse, plausible scene graphs with safety guarantees suitable for downstream tasks, including robotics and multimodal reasoning.

Abstract

Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.

Conformal Prediction and MLLM aided Uncertainty Quantification in Scene Graph Generation

TL;DR

This work introduces PC-SGG, a post-hoc, model-agnostic Conformal Prediction framework for uncertainty quantification in Scene Graph Generation. It builds class-conditional prediction sets for objects and predicates, combines them into triplet sets with formal coverage guarantees, and further refines these sets using an MLLM-based plausibility filter via MCQA prompts and in-context learning. Empirical evaluation on VG150 across five SGG backbones shows that PC-SGG provides calibrated uncertainty estimates, improves tail-class recall, and substantially reduces set sizes with minimal impact on overall coverage. The combination enables generation of diverse, plausible scene graphs with safety guarantees suitable for downstream tasks, including robotics and multimodal reasoning.

Abstract

Scene Graph Generation (SGG) aims to represent visual scenes by identifying objects and their pairwise relationships, providing a structured understanding of image content. However, inherent challenges like long-tailed class distributions and prediction variability necessitate uncertainty quantification in SGG for its practical viability. In this paper, we introduce a novel Conformal Prediction (CP) based framework, adaptive to any existing SGG method, for quantifying their predictive uncertainty by constructing well-calibrated prediction sets over their generated scene graphs. These scene graph prediction sets are designed to achieve statistically rigorous coverage guarantees. Additionally, to ensure these prediction sets contain the most practically interpretable scene graphs, we design an effective MLLM-based post-processing strategy for selecting the most visually and semantically plausible scene graphs within these prediction sets. We show that our proposed approach can produce diverse possible scene graphs from an image, assess the reliability of SGG methods, and improve overall SGG performance.

Paper Structure

This paper contains 28 sections, 4 theorems, 11 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Given the ground truth class of the $k^{th}$ triplet is denoted as $y_k^t=[y_{k}^s,y_{k}^r,y_{k}^o] \in \mathbb{R}^3$ where $y_{k}^s,y_{k}^o \in \mathcal{Y}_o$ and $y_{k}^r\in \mathcal{Y}_r$, the triplet coverage guarantee is given as $P(y_k^t \in \hat{\mathcal{C}}_t(X_{n+1}^r)) = P(Y_{n+1}^o \in \h

Figures (9)

  • Figure 1: Distinction between standard and conformal SGG. The upper half shows how a standard SGG method generates a single prediction of a triplet in the image's scene graph. The lower half shows how, by adding conform prediction blocks on top of an SGG model, we can generate prediction sets for each triplet in the scene graph, which quantifies the underlying model's uncertainty and improve the chances of covering the actual ground truth.
  • Figure 2: Overview of PC-SGG Pipeline. For each Test Image, a pre-trained SGG model, $\phi$, is used to obtain object bounding boxes $\hat{\mathbf{b}}_o$ (using $f_{bbox}$), object classification probabilities $\Pi_o$ (using $f_{o}$), and the probabilities of their pairwise predicates classifications which include the classification $\Pi_r$ (using $f_{r}$). Using quantiles ($Q_o, Q_r$) derived from a Calibration Data, we construct class-conditional conformal sets for both objects ($\hat{\mathcal{C}}_o(X)$) and predicates ($\hat{\mathcal{C}}_r(X)$). These conformal sets are then combinatorially combined ($\oplus$) to generate a Triplet Prediction Set, $\hat{\mathcal{C}}_t(X)$. To assess the plausibility of each entry of $\hat{\mathcal{C}}_t(X)$, we leverage an MLLM-based post-processing unit. The entries of $\hat{\mathcal{C}}_t(X)$ are converted into textual descriptions, which, along with the cropped portion of the test image defined for the triplet set (cropped using the union bounding box of the triplet's object pairs) is converted into an input prompt for the MLLM to process and predict the truncated prediction set of the most plausible triplets as a next token prediction problem.
  • Figure 3: Prompting strategy for plausibility assessment. First, a system prompt outlines the task for the MLLM. Then, an example prompt is created with a randomly sampled image from the calibration set and a hand-crafted text description, framing plausibility assessment as an MCQA problem. During inference, entries from a test image's triplet prediction set are processed in groups of $5$, for designing the MCQ text prompts similar to the example. The vision part of the prompt is the cropped portion of the test image linked to the detected triplet. The MLLM’s token likelihoods are thresholded by $\tau$ to identify the most plausible choices for the scene. The ground-truth triplet is highlighted in green for both the example and inference scenarios.
  • Figure 4: The R@50 and cR@50 for $5$ of tail classes of VG150, with the least number of samples. The results are for the BGNN model. It can be observed that the triplet prediction sets from BGNN+PC-SGG, significantly improve the performance of detecting the tail class predicates compared to the standalone model.
  • Figure 5: Comparison of $Cov_T$ and $AvgSize$ as functions of token threshold, $\tau$, for BGNN+PC-SGG.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Corollary 1
  • Theorem 1
  • proof
  • Corollary 1
  • proof