Table of Contents
Fetching ...

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

Boyong Wu, Sanghwan Kim, Zeynep Akata

Abstract

Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.

From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

Abstract

Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and LLM. We further conduct an intervention based attention knockout analysis to test whether cross-token attention progressively refines visual representations, and an evaluation of bidirectional attention among image tokens on spatial consistency. Our analysis reveals that the adapter introduces a segmentation representation drop-off, but LLM layers progressively recover through attention-mediated refinement, where correctly classified tokens steer misclassified neighbors toward the correct label. At early image token positions, this recovery is bounded by causal attention, which bidirectional attention among image tokens alleviates. These findings provide a mechanistic account of how MLLMs process visual information for segmentation, informing the design of future segmentation-capable models.
Paper Structure (31 sections, 2 equations, 13 figures, 3 tables)

This paper contains 31 sections, 2 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Overview of main findings.(a) Layerwise linear probing on ADE20K: the adapter introduces a representation drop-off, but LLM layers progressively recover segmentation quality. (b) Attention knockout on conflicting class pairs: knocking out attention from correctly classified tokens degrades segmentation, confirming that cross-token self-refinement is driven by semantic anchors. (c) Per-token pixel accuracy at an intermediate LLM layer: causal attention starves early position tokens of semantic anchors, while bidirectional attention among image tokens alleviates this bottleneck.
  • Figure 2: Overview of the three analysis methods.(a) Layerwise linear probing: given an input image, the vision encoder produces patch token embeddings (brown), which are projected by the adapter into the LLM's embedding space. Inside the LLM, image tokens are processed jointly with text prompt tokens (yellow). At each layer $\ell$, we extract only the image token representations and train an independent linear probe to predict per-patch semantic classes, reassembled into a 2D segmentation map. (b) Attention knockout: we selectively block attention to incorrectly classified tokens (left) or correctly classified tokens (right) across all LLM layers, testing whether cross-token attention drives self-refinement. (c) Bidirectional attention mask: image tokens attend to each other bidirectionally while all other token pairs retain causal masking, alleviating context starvation at early image positions.
  • Figure 3: Layerwise linear probing results across the MLLM stack. mIoU on ADE20K for CLIP, DINOv2, and SigLIP encoders paired with Vicuna-7B, measured at the vision encoder output, adapter output, and each LLM layer. All three encoders exhibit a drop at the adapter followed by progressive recovery across LLM layers. Dashed lines mark the best-performing layer; values on the right indicate the total mIoU improvement across LLM layers.
  • Figure 4: Qualitative segmentation predictions across the MLLM stack. From left to right: input image, ground truth, linear probe prediction at the vision encoder output, adapter output, and at an intermediate LLM layer. Representation drop-off at the adapter but deeper layers appear to resolve class confusions (e.g., wall vs. bed) and produce more spatially coherent predictions.
  • Figure 5: UMAP projections of patch-level hidden states across the MLLM stack. Each point represents one of 576 image patch tokens, colored by semantic class. At the adapter output, classes are interleaved, by layer 20, same-class patches form distinct clusters, illustrating the progressive emergence of semantic structure through the LLM layers.
  • ...and 8 more figures