Table of Contents
Fetching ...

Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

Hang Yin, Zhifeng Lin, Xin Liu, Bin Sun, Kan Li

TL;DR

This work introduces the Compass Direction Reasoning (CDR) benchmark to evaluate multimodal language models on both spatial and compass directions, addressing a gap where compass reasoning under real-world rules is underexplored. CDR leverages simple, low-ambiguity images across icon, letter, and number modalities with balanced direction distributions and a diverse set of QA types to probe absolute and relative direction understanding. Experiments across multiple models show that while object classification remains robust, absolute and relative direction reasoning remains challenging; mixdata and CoT-based fine-tuning notably improve compass reasoning, with the best configurations approaching but not reaching human-level performance. The findings highlight the value of structured, reasoning-focused fine-tuning for real-world direction cognition and point to future work incorporating more complex, map-like imagery and 3D directional reasoning to further close the gap to human capabilities.

Abstract

Direction reasoning is essential for intelligent systems to understand the real world. While existing work focuses primarily on spatial reasoning, compass direction reasoning remains underexplored. To address this, we propose the Compass Direction Reasoning (CDR) benchmark, designed to evaluate the direction reasoning capabilities of multimodal language models (MLMs). CDR includes three types images to test spatial (up, down, left, right) and compass (north, south, east, west) directions. Our evaluation reveals that most MLMs struggle with direction reasoning, often performing at random guessing levels. Experiments show that training directly with CDR data yields limited improvements, as it requires an understanding of real-world physical rules. We explore the impact of mixdata and CoT fine-tuning methods, which significantly enhance MLM performance in compass direction reasoning by incorporating diverse data and step-by-step reasoning, improving the model's ability to understand direction relationships.

Do Multimodal Language Models Really Understand Direction? A Benchmark for Compass Direction Reasoning

TL;DR

This work introduces the Compass Direction Reasoning (CDR) benchmark to evaluate multimodal language models on both spatial and compass directions, addressing a gap where compass reasoning under real-world rules is underexplored. CDR leverages simple, low-ambiguity images across icon, letter, and number modalities with balanced direction distributions and a diverse set of QA types to probe absolute and relative direction understanding. Experiments across multiple models show that while object classification remains robust, absolute and relative direction reasoning remains challenging; mixdata and CoT-based fine-tuning notably improve compass reasoning, with the best configurations approaching but not reaching human-level performance. The findings highlight the value of structured, reasoning-focused fine-tuning for real-world direction cognition and point to future work incorporating more complex, map-like imagery and 3D directional reasoning to further close the gap to human capabilities.

Abstract

Direction reasoning is essential for intelligent systems to understand the real world. While existing work focuses primarily on spatial reasoning, compass direction reasoning remains underexplored. To address this, we propose the Compass Direction Reasoning (CDR) benchmark, designed to evaluate the direction reasoning capabilities of multimodal language models (MLMs). CDR includes three types images to test spatial (up, down, left, right) and compass (north, south, east, west) directions. Our evaluation reveals that most MLMs struggle with direction reasoning, often performing at random guessing levels. Experiments show that training directly with CDR data yields limited improvements, as it requires an understanding of real-world physical rules. We explore the impact of mixdata and CoT fine-tuning methods, which significantly enhance MLM performance in compass direction reasoning by incorporating diverse data and step-by-step reasoning, improving the model's ability to understand direction relationships.

Paper Structure

This paper contains 12 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Accuracy of multimodal language models on the Relative Compass Reasoning Task. We fine-tune the LLaVA-7B on CDR and significantly improves to 53.43%.
  • Figure 2: Distribution of CDR answers in spatial and compass reasoning tasks.