Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

Jianing Li; Xi Nan; Ming Lu; Li Du; Shanghang Zhang

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

Jianing Li, Xi Nan, Ming Lu, Li Du, Shanghang Zhang

TL;DR

Proximity QA addresses the gap in geometric understanding of multi-modal large language models by teaching them to infer object depth and proximity from images. It introduces a two-stage framework: first perceiving relative depths in $[0,1]$, then reasoning about proximity using depth-informed relations, leveraging a CLIP-based vision encoder and LLaVA-based LLM with LoRA fine-tuning. A new dataset, Proximity-110K, augments VQA conversations with depth and proximity instructions, enabling robust depth perception and proximity analysis demonstrated to outperform state-of-the-art MLLMs on converted GQA and Make3D benchmarks. The work provides a practical pathway to integrate semantic and geometric scene understanding in MLLMs, with code and dataset released for research reuse.

Abstract

Multi-modal large language models (MLLMs) have demonstrated remarkable vision-language capabilities, primarily due to the exceptional in-context understanding and multi-task learning strengths of large language models (LLMs). The advent of visual instruction tuning has further enhanced MLLMs' performance in vision-language understanding. However, while existing MLLMs adeptly recognize \textit{what} objects are in an image, they still face challenges in effectively discerning \textit{where} these objects are, particularly along the distance (scene depth) axis. To overcome this limitation in MLLMs, we introduce Proximity Question Answering (Proximity QA), a novel framework designed to enable MLLMs to infer the proximity relationship between objects in images. The framework operates in two phases: the first phase focuses on guiding the models to understand the relative depth of objects, and the second phase further encourages the models to infer the proximity relationships between objects based on their depth perceptions. We also propose a VQA dataset called Proximity-110K, containing additional instructions that incorporate depth information and the proximity relationships of objects. We have conducted extensive experiments to validate Proximity QA's superior ability in depth perception and proximity analysis, outperforming other state-of-the-art MLLMs. Code and dataset will be released at \textcolor{magenta}{https://github.com/NorthSummer/ProximityQA.git}.

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

TL;DR

, then reasoning about proximity using depth-informed relations, leveraging a CLIP-based vision encoder and LLaVA-based LLM with LoRA fine-tuning. A new dataset, Proximity-110K, augments VQA conversations with depth and proximity instructions, enabling robust depth perception and proximity analysis demonstrated to outperform state-of-the-art MLLMs on converted GQA and Make3D benchmarks. The work provides a practical pathway to integrate semantic and geometric scene understanding in MLLMs, with code and dataset released for research reuse.

Abstract

Paper Structure (23 sections, 2 equations, 3 figures, 8 tables)

This paper contains 23 sections, 2 equations, 3 figures, 8 tables.

Introduction
Related Work
Multimodal Large Language Models
Visual Question Answering
Proximity Question and Answering
Problem Defination
Framework Architecture and Training Scheme
Dataset: Proximity-110K
Data Source
Conversation Generation
Statistics and Analysis
Experiments
Settings
Qualitative Results
Quantitative Results
...and 8 more sections

Figures (3)

Figure 1: Deep vision models can derive dense geometric information of a scene by estimating accurate depth maps, but humans often understand scenes with both semantic and geometric information. We enable MLLMs to achieve this integrated understanding of semantic and geometric information through multi-modal instructions, thus creating a perception pattern that more closely aligns with human intuition.
Figure 2: Network architecture of Proximity QA in part (a) and the construction pipeline of Proximity-110K in part (b). We adopted a two-stage visual instruction tuning approach to achieve proximity relationship analysis of objects in the image. In the generation of Proximity-110K, we incorporate depth information into the original conversations and build new instructions.
Figure 3: We calculate the distribution of object amounts by depth in the Proximity-110K dataset illustrated in the histogram. The horizontal axis of the histogram denotes depth intervals, while the vertical axis indicates the amount of objects.

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

TL;DR

Abstract

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)