Table of Contents
Fetching ...

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Chashi Mahiul Islam, Oteo Mamo, Samuel Jacob Chacko, Xiuwen Liu, Weikuan Yu

TL;DR

The paper tackles the limited spatial reasoning of vision-language models in 3D scenes by introducing SpatialViLT and MaskedSpatialViLT, which are trained with multitask spatial supervision to predict depth maps, 3D coordinates, and edge maps, thereby enriching multimodal embeddings. A SpatialEnsemble is proposed to fuse predictions from specialized spatial experts, achieving state-of-the-art results on the Visual Spatial Reasoning (VSR) dataset, particularly in directional, topological, proximity, and unallocated relations. Despite strong overall gains, the work reveals a generalization gap in ensemble orientation performance and suggests dynamic weighting and pose/trajectory cues as future directions. Overall, the framework advances spatial priors in vision-language modeling and points toward more robust 3D-aware multimodal understanding with practical implications for real-world reasoning tasks.

Abstract

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

TL;DR

The paper tackles the limited spatial reasoning of vision-language models in 3D scenes by introducing SpatialViLT and MaskedSpatialViLT, which are trained with multitask spatial supervision to predict depth maps, 3D coordinates, and edge maps, thereby enriching multimodal embeddings. A SpatialEnsemble is proposed to fuse predictions from specialized spatial experts, achieving state-of-the-art results on the Visual Spatial Reasoning (VSR) dataset, particularly in directional, topological, proximity, and unallocated relations. Despite strong overall gains, the work reveals a generalization gap in ensemble orientation performance and suggests dynamic weighting and pose/trajectory cues as future directions. Overall, the framework advances spatial priors in vision-language modeling and points toward more robust 3D-aware multimodal understanding with practical implications for real-world reasoning tasks.

Abstract

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.

Paper Structure

This paper contains 16 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Challenging spatial reasoning examples from the VSR dataset that highlight current limitations in vision-language models. These cases demonstrate failures in orientation understanding (airplane-truck), proximity detection in complex scenes (bus-cup), topological reasoning (cat-vase), semantic consistency (cake-dog), contact detection (plant-teddy bear), and precise spatial positioning (bed-bench).
  • Figure 2: Framework for training SpatialViLT and Masked SpatialViLT. Captions and images are processed using ViLT (Base), with depth, 3D, and edge features extracted and decoded to calculate corresponding losses. These losses are backpropagated to improve the multimodal embeddings with spatial priors for enhanced spatial reasoning and classification.
  • Figure 3: Feature extraction pipeline demonstrating the multi-modal spatial feature generation process: (a) Original RGB image, (b) CLIPSeg-generated object masks for "dog" and "truck" extracted from caption text, (c) MiDaS depth map showing relative distances, (d) Canny edge map highlighting object boundaries, and (e) 3D coordinate map derived from depth information.
  • Figure 4: Accuracy and F1 Score of Different Models on the Evaluation Set.
  • Figure 5: Accuracy of Different Models for Orientation Meta-Category on the Evaluation Set.