Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Adyasha Maharana; Amita Kamath; Christopher Clark; Mohit Bansal; Aniruddha Kembhavi

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Adyasha Maharana, Amita Kamath, Christopher Clark, Mohit Bansal, Aniruddha Kembhavi

TL;DR

This work demonstrates that unified vision-language models exhibit substantial cross-task inconsistency across heterogeneous tasks, challenging the expectation of a single semantic backbone. It introduces CocoCon, a cross-task contrast-set benchmark spanning captioning, VQA, localization, and text-to-image generation, to quantify consistency via likelihood-based comparisons and ranking-based metrics. The authors propose a consistency-based training objective using soft rank correlation to align cross-task output spaces, achieving improved cross-task consistency with minimal or no loss to task accuracy. The findings reveal that cross-task consistency can be meaningfully improved through auxiliary training, offering a path toward more trustworthy, reliable multi-task vision-language systems suitable for integration into larger pipelines.

Abstract

As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, CocoCon, where we create contrast sets by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks. To alleviate this issue, we propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets, that improves the multi-task consistency of large unified models while retaining their original accuracy on downstream tasks.

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

TL;DR

Abstract

Paper Structure (27 sections, 4 equations, 14 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 14 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Contrast Sets for Cross-Task Consistency
The CocoCon Benchmark
Dataset Construction
Dataset Categories & Statistics
Evaluation
Consistency-based Training
Experimental Setup
Results
Evaluation of Pretrained Models
Calibration of Model Likelihoods for Text-to-image Generation
Common Failure Modes
Consistency-based Training
Conclusion
...and 12 more sections

Figures (14)

Figure 1: Examples of consistent and inconsistent predictions from Unified-IO$_{XL}$lu2022unified.
Figure 2: Illustration of our method for probing inconsistencies across tasks. We build candidate answers for multiple tasks that correspond to different semantic understandings of an image (e.g., keyboard vs. laptop), and check if the model's preferred answers across tasks match the same semantic understanding.
Figure 3: Step-by-step demonstration of the automated pipeline for generating contrast sets. Contrast sets generated from this pipeline are manually filtered to prepare the CocoCon benchmark.
Figure 4: Examples of contrast sets used in CocoCon. For each example, we show the relevant image (left), the ground truth caption, VQA question, or image generation prompt for the image with the perturbed concept in green (middle), the set of perturbations used to generate alternative answers and predictions from Unified-IO$_{XL}$ for VQA (V), image generation (G) and localization (L) (right columns). ✓ and $\times$ indicate scenarios where the model predictions for captioning and the corresponding task for that particular contrast set are consistent and inconsistent respectively. '-' denotes a lack of localization annotations for the sample.
Figure 5: Results from evaluation on the CocoCon benchmark. (a) % Consistency of Unified-IO$_{XL}$, OFA$_{HUGE}$, Kosmos-2 and GILL models for varying difficulty ($k$) and all tasks in CocoCon, (b) comparison of % accuracy with % consistency ($k$=1) values for all models evaluated in this paper and our OFA$_{Con}$ model (see Sec. \ref{['sec:training_obj']}), and (c) % consistency ($k$=1) values for different sizes of Unified-IO models.
...and 9 more figures

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

TL;DR

Abstract

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)