Table of Contents
Fetching ...

Asymmetric Idiosyncrasies in Multimodal Models

Muzi Tao, Chufan Shi, Huijuan Wang, Shengbang Tong, Xuezhe Ma

TL;DR

A classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.

Abstract

In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.

Asymmetric Idiosyncrasies in Multimodal Models

TL;DR

A classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.

Abstract

In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.
Paper Structure (50 sections, 1 equation, 7 figures, 6 tables)

This paper contains 50 sections, 1 equation, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of our pipeline. Multiple MLLMs generate captions for the same image, and a text classifier reliably attributes each caption to its source model. These captions are then used to synthesize new images, on which an image classifier performs the same source-identification task. While text-based attribution is highly accurate, image-based attribution fails, which reveals a clear mismatch between caption-space and image-space model signatures.
  • Figure 2: Word clouds of model-generated captions
  • Figure 3: Classification performance on generated images. The test accuracy of image classification on generated images is 49.85% with the SOTA model FLUX.1-schnell, while classification on a natural image dataset liu2025datasetbias of the same scale using the same network achieves 76.7%. Random guessing yields 33.3%.
  • Figure 4: Comparison of captions generated by Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o on the same images. Each row shows the original image on the left and, on the right, the captions generated by the three models and the images synthesized from those captions. The model outputs reveal several systematic failures across different attribute types. (i) In the first row, a simple descriptive color term, such as blue, without any accompanying texture specification, leads all models to produce images with broadly similar color–texture effects. However, none of the models reproduces the true, darker color in the original image. (ii) In the second column, even though captions include explicit view descriptions, the generated viewpoints remain inconsistent: a caption describing a high-angle view yields an eye-level rendering, while an eye-level description produces a low-angle output. (iii) In the third row, different color terms used to describe the acorn result in images that are still visually similar across models, and none of them match the actual color or appearance of the original image.
  • Figure 5: Detail-level rankings of both captions and generated images. Left: distribution of most-, moderately-, and least-detailed captions across the three models. Right: corresponding rankings assigned to the generated images.
  • ...and 2 more figures