Table of Contents
Fetching ...

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang

TL;DR

This work reveals that grounding capabilities can emerge in large multimodal models even without explicit grounding supervision. It introduces attend-and-segment, a training-free method that converts LMM attention into pixel-level grounding by prompting a segmentation model, and DiffLMM, a diffusion-based visual encoder that strengthens grounding while preserving general vision-language performance. Across grounded conversation generation, VQA, and other grounding tasks (RES, PNG), the approach achieves competitive or superior grounding metrics without biased grounding data, addressing generalizability and scalability concerns. The results demonstrate practical impact in creating more capable, generalist AI systems that can ground language to visuals without costly, task-specific supervision.

Abstract

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://GroundLMM-ICCV.github.io.

Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision

TL;DR

This work reveals that grounding capabilities can emerge in large multimodal models even without explicit grounding supervision. It introduces attend-and-segment, a training-free method that converts LMM attention into pixel-level grounding by prompting a segmentation model, and DiffLMM, a diffusion-based visual encoder that strengthens grounding while preserving general vision-language performance. Across grounded conversation generation, VQA, and other grounding tasks (RES, PNG), the approach achieves competitive or superior grounding metrics without biased grounding data, addressing generalizability and scalability concerns. The results demonstrate practical impact in creating more capable, generalist AI systems that can ground language to visuals without costly, task-specific supervision.

Abstract

Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://GroundLMM-ICCV.github.io.

Paper Structure

This paper contains 18 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Grounded conversations with GLaMM rasheed2024glammvs. our approach, DiffLMM + attend-and-segment.Left: As a state-of-the-art grounding LMM, GLaMM is trained to relate text phrases with segmentation masks while generating a response. However, due to limitations induced by the grounding supervision, it often fails to precisely follow the human user's instructions (e.g., describing the image in detail, answering the correct color). Middle: Our approach unlocks and enhances the grounding ability implicitly learned by LMMs without explicit grounding supervision, which leads to visually grounded responses while preserving the general vision-language conversation ability of LMMs. More examples are shown in Figure \ref{['fig:qual']} in the supplementary material. Right: Previous methods train grounding LMMs for visual grounding tasks at the cost of general visual question answering (VQA) performance. Our approach unlocks the implicit grounding ability in generalist LMMs without further training and preserves their conversation ability.
  • Figure 2: Meta-architecture of LMMs and the attend-and-segment strategy. In a standard LMM, an image encoder $M_V$ extracts visual features from an input image, and the features are transformed into visual tokens by a lightweight projector $M_{V\mapsto L}$. A large language model $M_L$ generates outputs in an auto-regressive manner. When generating a new token (e.g., "cat") which requires grounding, we capture the attention between the new token and the input visual tokens. Then a segmentation model (e.g., SAM kirillov2023segment) is prompted by the point with the highest normalized attention value to produce a segmentation mask (e.g., cat in the image).
  • Figure 3: Visual encoding in DiffLMM. We perform one denoising step with a pre-trained diffusion model (DM) ho2020denoisingrombach2022high, and extract visual features from an intermediate block of the U-Net. The learnable implicit captioner xu2023open produces text-like conditioning and improves the visual features extraction in the U-Net. We combine both DM features and CLIP features, and add learnable positional encodings to them. The final visual features are projected into the language feature space via a learnable projector, and fed into the LLM along with other text tokens. The DM and CLIP visual encoder are pre-trained and frozen. This diffusion-based visual encoder does not significantly influence the overall efficiency, as the major computation happens in the LLM.
  • Figure A: Comparison of model responses to challenging visual questions. 1) Unusual image contents: The model is requested to analyze the unusual aspect of a given image. Compared with GLaMM, our approach provides a more detailed and accurate answer with grounding. 2) Adversarial questions: The model is asked about something that does not exist in the image. GLaMM insists to segment the bike behind the bench in this example. 3) Rare visual concepts: The image contains objects of less frequent categories. In this example, GLaMM does not recognize the llama but describes it in a general manner, while our approach provides a more accurate description. 4) Shifted image domain: An image from a new domain is given to the model. Interestingly, our approach seems to be making the decision based on the texture and style in the painting. For visual clarity, we only show the beginning parts of our model responses if they are too long. These challenging examples demonstrates better generalizability of our approach.
  • Figure B: Qualitative results for grounded conversation generation. For visual clarity, we only display the best four non-overlapping segmentation masks per image.
  • ...and 4 more figures