Table of Contents
Fetching ...

Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji

TL;DR

The paper introduces CamCue, a pose-aware framework for multi-view spatial reasoning that grounds language-described viewpoints to explicit camera poses and optionally synthesizes a target-view image conditioned on the predicted pose. By injecting per-view pose into visual tokens and using a pose adapter to predict the target pose from language, CamCue achieves more coherent cross-view reasoning and improves performance on perspective-shift tasks by leveraging a pose-conditioned imagination step. The authors curate CamCue-Data (27,668 training, 508 test) and demonstrate substantial gains over strong baselines across multiple backbones, with high pose-prediction accuracy from language descriptions and drastic reductions in inference time compared to test-time search-based methods. They also perform extensive ablations, showing the importance of explicit pose grounding and pose-conditioned image synthesis for faithful, efficient reasoning, while acknowledging limitations related to reliability of imagined views in some scenarios. Overall, CamCue advances practical, geometry-aware multimodal reasoning by tightly coupling language grounding, pose estimation, and image synthesis in a unified framework.

Abstract

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.

Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

TL;DR

The paper introduces CamCue, a pose-aware framework for multi-view spatial reasoning that grounds language-described viewpoints to explicit camera poses and optionally synthesizes a target-view image conditioned on the predicted pose. By injecting per-view pose into visual tokens and using a pose adapter to predict the target pose from language, CamCue achieves more coherent cross-view reasoning and improves performance on perspective-shift tasks by leveraging a pose-conditioned imagination step. The authors curate CamCue-Data (27,668 training, 508 test) and demonstrate substantial gains over strong baselines across multiple backbones, with high pose-prediction accuracy from language descriptions and drastic reductions in inference time compared to test-time search-based methods. They also perform extensive ablations, showing the importance of explicit pose grounding and pose-conditioned image synthesis for faithful, efficient reasoning, while acknowledging limitations related to reliability of imagined views in some scenarios. Overall, CamCue advances practical, geometry-aware multimodal reasoning by tightly coupling language grounding, pose estimation, and image synthesis in a unified framework.

Abstract

Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
Paper Structure (40 sections, 8 equations, 6 figures, 9 tables, 2 algorithms)

This paper contains 40 sections, 8 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: Perspective-shift reasoning with CamCue. Given multi-view context images, CamCue maps a natural-language viewpoint description to an explicit target camera pose and synthesizes the corresponding target view for reliable spatial reasoning.
  • Figure 2: Given multiple contextual images with their camera poses and a natural-language target-viewpoint description plus question, CamCue encodes visual content and pixel-aligned camera pose features, fuses them into pose-aware visual tokens, and uses an MLLM with a pose adapter to jointly generate the answer and predict the target camera pose. The predicted pose can further condition an image decoder to synthesize an imagined target view, which is fed back as additional evidence for answering.
  • Figure 3: QA type distribution in training and test splits.
  • Figure 4: Qualitative comparison of imagined target views.
  • Figure 5: Data Samples from CamCue
  • ...and 1 more figures