Table of Contents
Fetching ...

F-LMM: Grounding Frozen Large Multimodal Models

Size Wu, Sheng Jin, Wenwei Zhang, Lumin Xu, Wentao Liu, Wei Li, Chen Change Loy

TL;DR

This work tackles the problem of grounding large multimodal models without sacrificing their general conversational abilities. It introduces F-LMM, which freezes off-the-shelf LMMs and leverages word-image attention as segmentation priors, translating them into masks with a lightweight CNN-based mask head and refining via a SAM-based mask head, while a simple keyword selector determines grounded words. Across RefCOCO(+/g) and PNG grounding tasks, F-LMM achieves competitive segmentation performance while preserving strong instruction-following and world-knowledge capabilities on standard QA benchmarks, and it demonstrates robustness on complex tasks like reasoning segmentation and grounded conversation. The approach offers a practical, resource-efficient path to deploy visually grounded yet chat-capable AI systems, with broad implications for grounded reasoning and visual language understanding.

Abstract

Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, F-LMM can be directly applied to complex tasks like reasoning segmentation, grounded conversation generation and visual chain-of-thought reasoning. Our code can be found at https://github.com/wusize/F-LMM.

F-LMM: Grounding Frozen Large Multimodal Models

TL;DR

This work tackles the problem of grounding large multimodal models without sacrificing their general conversational abilities. It introduces F-LMM, which freezes off-the-shelf LMMs and leverages word-image attention as segmentation priors, translating them into masks with a lightweight CNN-based mask head and refining via a SAM-based mask head, while a simple keyword selector determines grounded words. Across RefCOCO(+/g) and PNG grounding tasks, F-LMM achieves competitive segmentation performance while preserving strong instruction-following and world-knowledge capabilities on standard QA benchmarks, and it demonstrates robustness on complex tasks like reasoning segmentation and grounded conversation. The approach offers a practical, resource-efficient path to deploy visually grounded yet chat-capable AI systems, with broad implications for grounded reasoning and visual language understanding.

Abstract

Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, F-LMM can be directly applied to complex tasks like reasoning segmentation, grounded conversation generation and visual chain-of-thought reasoning. Our code can be found at https://github.com/wusize/F-LMM.
Paper Structure (20 sections, 3 equations, 13 figures, 12 tables)

This paper contains 20 sections, 3 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: An example of user-AI conversation around an image. Left: The current state-of-the-art grounding model GLaMM hanoona2023GLaMM is effective for grounded conversation when prompted by "answer with interleaved masks", but fails to follow user instruction to answer a single word (yes or no) and misunderstands the question as a referring segmentation prompt. Right: Our F-LMM preserves instruction-following ability while being able to perform visual grounding.
  • Figure 2: (a) Geometric and spatial cues conducive to visual grounding are observed in the visualisations of word-image attention maps in frozen LMMs. (b) Existing grounding LMMs are fine-tuned to generate a special mask token (e.g., [SEG]) for visual grounding purposes, which ruins the original conversational ability. (c) Our F-LMM translates word-image attention maps from frozen LMMs to grounding masks, while fully preserving the general-purpose chat capability.
  • Figure 3:
  • Figure 4: Visualisations of word-image attention maps. The letters $m$ and $n$ indicate that the attention map is derived from the $n$-th attention head of the $m$-th transformer layer.
  • Figure 5: Ablation study of the mask decoder.
  • ...and 8 more figures