Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning
Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji
TL;DR
The paper introduces CamCue, a pose-aware framework for multi-view spatial reasoning that grounds language-described viewpoints to explicit camera poses and optionally synthesizes a target-view image conditioned on the predicted pose. By injecting per-view pose into visual tokens and using a pose adapter to predict the target pose from language, CamCue achieves more coherent cross-view reasoning and improves performance on perspective-shift tasks by leveraging a pose-conditioned imagination step. The authors curate CamCue-Data (27,668 training, 508 test) and demonstrate substantial gains over strong baselines across multiple backbones, with high pose-prediction accuracy from language descriptions and drastic reductions in inference time compared to test-time search-based methods. They also perform extensive ablations, showing the importance of explicit pose grounding and pose-conditioned image synthesis for faithful, efficient reasoning, while acknowledging limitations related to reliability of imagined views in some scenarios. Overall, CamCue advances practical, geometry-aware multimodal reasoning by tightly coupling language grounding, pose estimation, and image synthesis in a unified framework.
Abstract
Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
