Table of Contents
Fetching ...

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

TL;DR

This paper addresses the weak spatial reasoning of current visual-language models by introducing LocVLM, a unified framework that injects image-space coordinates into the language model via instruction fine-tuning. It proposes three objectives—LocPred, NegPred, and RevLoc—and evaluates several coordinate representations, with integer-valued binning found to be effective, supported by pseudo-data generation (Localize-Instruct-200K and PRefCOCO-100K) to scale training, including video-domain adaptations (LocVLM-Vid-B/B+). The approach yields state-of-the-art performance on image VQA benchmarks (GQA, VQAv2), competitive or superior results on video VQA, reduced object hallucination, and enhanced region-description capabilities across 14 datasets and 5 VL tasks. The work demonstrates that textual localization, combined with data-efficient pseudo-labeling, provides robust spatial understanding in V-LLMs with practical impact for more reliable visual reasoning and descriptive capabilities.

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

TL;DR

This paper addresses the weak spatial reasoning of current visual-language models by introducing LocVLM, a unified framework that injects image-space coordinates into the language model via instruction fine-tuning. It proposes three objectives—LocPred, NegPred, and RevLoc—and evaluates several coordinate representations, with integer-valued binning found to be effective, supported by pseudo-data generation (Localize-Instruct-200K and PRefCOCO-100K) to scale training, including video-domain adaptations (LocVLM-Vid-B/B+). The approach yields state-of-the-art performance on image VQA benchmarks (GQA, VQAv2), competitive or superior results on video VQA, reduced object hallucination, and enhanced region-description capabilities across 14 datasets and 5 VL tasks. The work demonstrates that textual localization, combined with data-efficient pseudo-labeling, provides robust spatial understanding in V-LLMs with practical impact for more reliable visual reasoning and descriptive capabilities.

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.
Paper Structure (25 sections, 5 figures, 14 tables)

This paper contains 25 sections, 5 figures, 14 tables.

Figures (5)

  • Figure 1: We illustrate one unique ability of our model: contextual region description (top). Note the contextual information used in describing the selected region in each image. Explicitly teaching localization to Visual-LLMs also improves spatial awareness in VQA settings (bottom). Color boxes only for illustration purposes.
  • Figure 2: Architecture: We present the overall model architecture of our framework which is inspired from LLaVa liu2023visual.
  • Figure 3: Visualizing Spatial Reasoning: We illustrate example images on which we perform our toy experiment for spatial reasoning (\ref{['supp:toy']}). Success cases on top row (green) and failure cases on bottom row (red).
  • Figure 4: Visualizing Region Description: Our framework possesses the unique ability of generating representative descriptions for a selected region of an image, input to the model in terms of textual coordinates. We illustrate 3 example images with a bounding box (green) denoting the queried region. The responses generated by our model are underneath each image, with invalid outputs highlighted red.
  • Figure 5: Visualization of LocPred Objective: We illustrate the bounding box locations generated by our framework (blue) when queried with a category label (top of each image) and compare with the ground-truth bounding boxes (green). Success cases on top and failure cases on bottom.