Table of Contents
Fetching ...

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu

TL;DR

LocateAnything3D presents a VLM-native approach to monocular 3D detection by casting 3D inference as a next-token prediction task using a Chain-of-Sight (CoS) decoding scheme. By grounding each 3D prediction in an explicit 2D detection, ordering detections from near to far, and factorizing each box into center, size, and rotation, the method achieves state-of-the-art results on Omni3D with $ ext{AP}_{3D}=49.89$ and strong zero-shot generalization. The authors curate a large, camera-centric training corpus and couple 2D grounding pretraining with end-to-end CoS training, yielding data-efficient learning and robust performance across indoor and outdoor scenes. The work demonstrates a practical, open-vocabulary pathway to unify semantic understanding and metric 3D perception inside a single vision-language model, with potential extensions to video and embodied planning.

Abstract

To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

TL;DR

LocateAnything3D presents a VLM-native approach to monocular 3D detection by casting 3D inference as a next-token prediction task using a Chain-of-Sight (CoS) decoding scheme. By grounding each 3D prediction in an explicit 2D detection, ordering detections from near to far, and factorizing each box into center, size, and rotation, the method achieves state-of-the-art results on Omni3D with and strong zero-shot generalization. The authors curate a large, camera-centric training corpus and couple 2D grounding pretraining with end-to-end CoS training, yielding data-efficient learning and robust performance across indoor and outdoor scenes. The work demonstrates a practical, open-vocabulary pathway to unify semantic understanding and metric 3D perception inside a single vision-language model, with potential extensions to video and embodied planning.

Abstract

To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

Paper Structure

This paper contains 32 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: LocateAnything3D unifies 3D detection and grounding in a single vision-language model. It supports open-world categories with free-form text guidance and flexible visual prompts (e.g., drag boxes, click points). All examples are zero-shot, highlighting strong out-of-domain generalizability. The bar chart (right) shows that LocateAnything3D achieves state-of-the-art $\mathrm{AP_{3D}}$ on Omni3D benchmark.
  • Figure 2: Architecture of LocateAnything3D. (1) Model input: a single RGB image with text and optional visual prompts (boxes/clicks). (2) Chain-of-Sight (CoS) decoding: a VLM decoder first emits 2D detections as an explicit visual evidence, then continues the sequence to 3D. Decoding follows three layers of design: inter-object curriculum ordering detections from near to far; intra-object factorization using 2D as CoS to robustly infer 3D; and intra-3D tokenization that outputs center, size, and rotation. (3) We output calibrated multi-object 3D boxes with open-vocabulary categories and flexible prompting, yielding strong results on Omni3D. We use turbo colormap for boxes to demonstrate their depth, where reddish and blueish colors indicate closer and farther objects, respectively.
  • Figure 3: Qualitative results of LocateAnything3D. For each example, the left sub-figure overlays the projected 3D bounding boxes on the input image, while the right sub-figure shows the corresponding bird's-eye view with 1m$\times$1m grids as the background. We use a turbo colormap based on depth, where redish colors indicate objects closer to the camera, and blueish colors indicate objects farther away.
  • Figure 4: Data efficiency and training dynamics analysis. (1) The left figure shows data efficiency: We report $\rm AP_{3D}$ vs. percentage of training data used. Our Chain-of-Sight (CoS) formulation (blue) consistently outperforms direct 3D prediction (purple), achieving competitive performance with only 10% of the data. (2) The right figure shows training dynamics: We compare training curves with and without 2D detection pretraining. 2D pretraining (green) accelerates convergence significantly, surpassing the previous state of the art (dashed line) almost immediately, whereas training from scratch (orange) is slower and yields lower final accuracy.
  • Figure 5: Visualization of failure cases. We show several failure cases of our model. Due to the lack of diverse 3D annotations, similar to the baselines detany3dov3d, our model faces challenges when presented with scenes that exhibit very different focal length, spatial layouts, and textural details.
  • ...and 1 more figures