Table of Contents
Fetching ...

DeiSAM: Segment Anything with Deictic Prompting

Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami, Patrick Schramowski, Kristian Kersting

TL;DR

DeiSAM addresses the challenge of deictic, context-dependent image segmentation by integrating large language models (LLMs) for logic-rule generation with differentiable forward reasoning over scene graphs, followed by grounding via segmentation. The approach introduces a modular neuro-symbolic pipeline that unifies terminology and uses a differentiable reasoning function $f_{ ext{reason}}$ to identify target objects, enabling end-to-end training. To evaluate reasoning-heavy prompts, the authors present the Deictic Visual Genome ($DeiVG$) benchmark and a deictic extension of RefCOCO/RefCOCO+, demonstrating substantial gains over purely neural baselines. The work highlights the potential of combining LLM-driven symbolic reasoning with neural vision components to handle abstract, relational prompts in segmentation tasks, and provides avenues for training and improving scene-graph generators in end-to-end fashion.

Abstract

Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i.e., referring to something depending on the context such as "The object that is on the desk and behind the cup.". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM -- a combination of large pre-trained neural networks with differentiable logic reasoners -- for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.

DeiSAM: Segment Anything with Deictic Prompting

TL;DR

DeiSAM addresses the challenge of deictic, context-dependent image segmentation by integrating large language models (LLMs) for logic-rule generation with differentiable forward reasoning over scene graphs, followed by grounding via segmentation. The approach introduces a modular neuro-symbolic pipeline that unifies terminology and uses a differentiable reasoning function to identify target objects, enabling end-to-end training. To evaluate reasoning-heavy prompts, the authors present the Deictic Visual Genome () benchmark and a deictic extension of RefCOCO/RefCOCO+, demonstrating substantial gains over purely neural baselines. The work highlights the potential of combining LLM-driven symbolic reasoning with neural vision components to handle abstract, relational prompts in segmentation tasks, and provides avenues for training and improving scene-graph generators in end-to-end fashion.

Abstract

Large-scale, pre-trained neural networks have demonstrated strong capabilities in various tasks, including zero-shot image segmentation. To identify concrete objects in complex scenes, humans instinctively rely on deictic descriptions in natural language, i.e., referring to something depending on the context such as "The object that is on the desk and behind the cup.". However, deep learning approaches cannot reliably interpret such deictic representations due to their lack of reasoning capabilities in complex scenarios. To remedy this issue, we propose DeiSAM -- a combination of large pre-trained neural networks with differentiable logic reasoners -- for deictic promptable segmentation. Given a complex, textual segmentation description, DeiSAM leverages Large Language Models (LLMs) to generate first-order logic rules and performs differentiable forward reasoning on generated scene graphs. Subsequently, DeiSAM segments objects by matching them to the logically inferred image regions. As part of our evaluation, we propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual input and complex, deictic textual prompts. Our empirical results demonstrate that DeiSAM is a substantial improvement over purely data-driven baselines for deictic promptable segmentation.
Paper Structure (27 sections, 6 equations, 12 figures, 11 tables)

This paper contains 27 sections, 6 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: DeiSAM segments objects with deictic prompting. Shown are segmentation masks with an input textual prompt. DeiSAM (right) correctly segments the people on the boat holding umbrellas, whereas the neural baselines (left) incorrectly segment the boat instead (Best viewed in color).
  • Figure 2: DeiSAM architecture. An image paired with a deictic prompt is given as input. We parse the image into a scene graph (1) and generate logic rules (2) corresponding to the deictic prompt using a large language model. The generated scene graph and rules are fed to the Semantic Unifier module (3), where synonymous terms are unified. For example, $\texttt{barge}$ in the scene graph and $\texttt{boat}$ in the generated rules will be interpreted as the same term. Next, the forward reasoner (4) infers target objects specified by the textual deictic prompt. Lastly, we perform object segmentation (5) on extracted cropped image regions of the target objects. Since the forward reasoner is differentiable Shindo23alphailp_mlj, gradients can be passed through the entire pipeline (Best viewed in color).
  • Figure 3: An example from Deictic Visual Genome (DeiVG$_2$).
  • Figure 4: DeiSAM handles ambiguous prompts. Results with prompts (top) with scene graphs (bottom).
  • Figure 5: DeiSAM segments objects with deictic prompts. Segmentation results on the DeiVG dataset using DeiSAM and baselines are shown with deictic prompts. DeiSAM correctly identifies and segments objects given deictic prompts (left-most column), while the baselines often segment a wrong object. More results are available in App. \ref{['sec:more_segment_results']} (Best viewed in color).
  • ...and 7 more figures

Theorems & Definitions (1)

  • Definition A.1