Table of Contents
Fetching ...

The Telephone Game: Evaluating Semantic Drift in Unified Models

Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah

TL;DR

Unified Visual-Language Models often drift in meaning when alternating between $T2I$ and $I2T$, a failure not captured by traditional single-pass benchmarks. The authors introduce the Semantic Drift Protocol (SDP), which uses Text-First and Image-First generation chains to measure semantic retention via Mean Cumulative Drift ($MCD$) and Multi-Generation GenEval ($MGG$) on a NoCaps+Docci400 generalization set across seven recent models from three architectural families. Results show substantial cross-modal drift with clear between-model variation (e.g., BAGEL vs VILA-U/Janus), and strong alignment between automated metrics and human judgments. The work argues that cyclic evaluation is essential for reliably assessing cross-consistency in unified models and provides code for replicability and further study.

Abstract

Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Code is available at https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

The Telephone Game: Evaluating Semantic Drift in Unified Models

TL;DR

Unified Visual-Language Models often drift in meaning when alternating between and , a failure not captured by traditional single-pass benchmarks. The authors introduce the Semantic Drift Protocol (SDP), which uses Text-First and Image-First generation chains to measure semantic retention via Mean Cumulative Drift () and Multi-Generation GenEval () on a NoCaps+Docci400 generalization set across seven recent models from three architectural families. Results show substantial cross-modal drift with clear between-model variation (e.g., BAGEL vs VILA-U/Janus), and strong alignment between automated metrics and human judgments. The work argues that cyclic evaluation is essential for reliably assessing cross-consistency in unified models and provides code for replicability and further study.

Abstract

Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Code is available at https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models

Paper Structure

This paper contains 20 sections, 2 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: (a) Illustrates image generation and image understanding functionalities of a unified model. (b) Telephone Game: We propose a new form of evaluation consisting of alternating T2I and I2T steps. Here, the unified model starts from a textual prompt $T^{(0)}$ about a suitcase and a banana. At every step we observe semantic drift. For example, in the 5 generation, the model fails to generate a convincing suitcase, which also hints at cross-inconsistency. These phenomena are magnified under the multi-generation telephone game evaluation, allowing it to capture more subtle performance differences between models.
  • Figure 2: An example of cross-consistency in the BAGEL unified model. Given an image of a chess board along with a question (top), BAGEL performs I2T, correctly answering "white side wins". By creating another caption for the T2I prompt (bottom), BAGEL should generate a chess board image consistent with the same semantic predicate (white winning side). However, the model generates a generic, mismatched chessboard image. This exposes a unified model inconsistency: BAGEL’s correct visual reasoning (I2T) does not carry over to generation (T2I) for the concept "winning side in chess".
  • Figure 3: On the left, a single model handles both understanding and generation. In the middle, the architecture partially shares weights, with a decoder capable of generating text and visual features, the latter is passed to another image generation model. On the right, the understanding and generation processes are fully decoupled, using separate models for each task.
  • Figure 4: Semantic Drift Protocol (SDP). We alternate between text-to-image (T2I) and image-to-text (I2T) generations in two setups: Text-First-Chain (a) and Image-First-Chain (b). Blue arrows denote I2T; purple arrows denote T2I; dashed black arrows indicate similarities computed back to the initial input in both same‑ and cross‑modality directions used for MCD. Across generations, concepts drift despite plausible single steps: a "red F-450 truck” evolves into a semi‑truck with changing attachments and positions; in the image‑first chain, group size inflates and new objects (e.g., a sports ball) appear. The proposed cyclic evaluation reveals cross‑modal concept drift that single‑pass metrics overlook, enabling direct comparison of unified model's semantic stability.
  • Figure 5: Information can be lost in different ways during a cyclic inference. In the first row, the model ignores the position of the clock, which is a crucial detail. In the second row, the model changes a baseball bat into a spoon. A model can also change the style from realistic to cartoon, as shown in the third row. In the fourth row the model loses count of four clocks and generates lots of clocks instead. In the fifth row a whole city is hallucinated around an empty road. In the sixth row, the model changes a brown bus into a yellow bus.
  • ...and 8 more figures