Table of Contents
Fetching ...

Re-Aligning Language to Visual Objects with an Agentic Workflow

Yuming Chen, Jiangyan Feng, Haodong Zhang, Lijun Gong, Feng Zhu, Rui Zhao, Qibin Hou, Ming-Ming Cheng, Yibing Song

TL;DR

The paper tackles misalignment between language expressions and visual objects in language-based object detection (LOD) caused by VLM hallucinations during automatic expression generation. It introduces Real-LOD, an agentic workflow powered by LLMs that plans, uses tools (VLM/LLM), and reflects to iteratively re-align language to object regions, producing Real-Data for training. With Real-Data (0.18M images and ~1.346M language-object pairs) and a Real-Model trained on this data, the approach achieves substantial performance gains (around 50% relative improvement) over state-of-the-art methods on standard benchmarks, validated by thorough ablations and cost analysis. The results suggest that preserving data quality while scaling data through cyclic, agentic refinement can significantly enhance VL alignment in LOD and offers a scalable paradigm for refining training data in vision-language tasks.

Abstract

Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.

Re-Aligning Language to Visual Objects with an Agentic Workflow

TL;DR

The paper tackles misalignment between language expressions and visual objects in language-based object detection (LOD) caused by VLM hallucinations during automatic expression generation. It introduces Real-LOD, an agentic workflow powered by LLMs that plans, uses tools (VLM/LLM), and reflects to iteratively re-align language to object regions, producing Real-Data for training. With Real-Data (0.18M images and ~1.346M language-object pairs) and a Real-Model trained on this data, the approach achieves substantial performance gains (around 50% relative improvement) over state-of-the-art methods on standard benchmarks, validated by thorough ablations and cost analysis. The results suggest that preserving data quality while scaling data through cyclic, agentic refinement can significantly enhance VL alignment in LOD and offers a scalable paradigm for refining training data in vision-language tasks.

Abstract

Language-based object detection (LOD) aims to align visual objects with language expressions. A large amount of paired data is utilized to improve LOD model generalizations. During the training process, recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects, facilitating training data scaling up. In this process, we observe that VLM hallucinations bring inaccurate object descriptions (e.g., object name, color, and shape) to deteriorate VL alignment quality. To reduce VLM hallucinations, we propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts. We name this workflow Real-LOD, which includes planning, tool use, and reflection steps. Given an image with detected objects and VLM raw language expressions, Real-LOD reasons its state automatically and arranges action based on our neural symbolic designs (i.e., planning). The action will adaptively adjust the image and text prompts and send them to VLMs for object re-description (i.e., tool use). Then, we use another LLM to analyze these refined expressions for feedback (i.e., reflection). These steps are conducted in a cyclic form to gradually improve language descriptions for re-aligning to visual objects. We construct a dataset that contains a tiny amount of 0.18M images with re-aligned language expression and train a prevalent LOD model to surpass existing LOD methods by around 50% on the standard benchmarks. Our Real-LOD workflow, with automatic VL refinement, reveals a potential to preserve data quality along with scaling up data quantity, which further improves LOD performance from a data-alignment perspective.

Paper Structure

This paper contains 36 sections, 21 figures, 17 tables, 1 algorithm.

Figures (21)

  • Figure 1: Examples of adaptive image and prompt modifications refine language expressions. For a small object in (a), VLM produces erroneous content marked in red. In (b), we crop the local region of (a) and obtain refined content marked in green. Another example is in (c), where a general prompt leads to erroneous content while a specific prompt in (d) does not.
  • Figure 2: Glimpse of our Real-LOD. It takes image captions with detected objects and raw expressions as inputs. It gradually re-aligns expressions to match objects well. By using better-aligned training data pairs, we improve the performance of LOD.
  • Figure 3: Overview of a general LOD framework. The paired VL data are independently encoded and then interacted to decode results.
  • Figure 4: Overview of our agentic workflow. The inputs are images with captions, detected objects, and raw expressions. Our Real-Agent reasons the state and arranges the action (i.e., planning). During action execution, our Real-Agent uses VLM and LLM to re-perceive visual content and refine expressions (i.e., tool use). Then, the output results are analyzed by an LLM (i.e., reflection). The feedback is provided to Real-Agent for planning in the next cycle.
  • Figure 5: An example of how Real-LOD re-aligns one raw expression to the given image. Based on the input image, caption, and detected objects, Real-LOD performs planning, tool use, and reflection in a cyclic workflow for state reasoning, action execution, and result feedback. The image and prompt are adaptively adjusted for tool models to supplement customized object descriptions, which benefit expression re-alignment.
  • ...and 16 more figures