LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Yunze Man; Shihao Wang; Guowen Zhang; Johan Bjorck; Zhiqi Li; Liang-Yan Gui; Jim Fan; Jan Kautz; Yu-Xiong Wang; Zhiding Yu

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

TL;DR

LocateAnything3D presents a VLM-native approach to monocular 3D detection by casting 3D inference as a next-token prediction task using a Chain-of-Sight (CoS) decoding scheme. By grounding each 3D prediction in an explicit 2D detection, ordering detections from near to far, and factorizing each box into center, size, and rotation, the method achieves state-of-the-art results on Omni3D with $ ext{AP}_{3D}=49.89$ and strong zero-shot generalization. The authors curate a large, camera-centric training corpus and couple 2D grounding pretraining with end-to-end CoS training, yielding data-efficient learning and robust performance across indoor and outdoor scenes. The work demonstrates a practical, open-vocabulary pathway to unify semantic understanding and metric 3D perception inside a single vision-language model, with potential extensions to video and embodied planning.

Abstract

To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

TL;DR

Abstract

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)