Table of Contents
Fetching ...

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Chun Feng, Joy Hsu, Weiyu Liu, Jiajun Wu

TL;DR

This work tackles 3D visual grounding under natural supervision by removing the need for dense object-level labels and introduces the Language-Regularized Concept Learner (LARC). LARC builds on neuro-symbolic NS3D and injects language-derived constraints distilled from large language models into regularization terms for structured intermediate representations, enabling better generalization, data efficiency, and transfer. The approach yields substantial gains over prior indirect-supervision methods, including strong zero-shot composition capabilities and cross-dataset transfer to ScanRefer, underscoring the practical potential of language priors for regularizing structured visual reasoning. Overall, LARC offers a general, data-efficient pathway to regularize neuro-symbolic 3D grounding with language priors, advancing naturally supervised learning in complex visual reasoning tasks.

Abstract

3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

TL;DR

This work tackles 3D visual grounding under natural supervision by removing the need for dense object-level labels and introduces the Language-Regularized Concept Learner (LARC). LARC builds on neuro-symbolic NS3D and injects language-derived constraints distilled from large language models into regularization terms for structured intermediate representations, enabling better generalization, data efficiency, and transfer. The approach yields substantial gains over prior indirect-supervision methods, including strong zero-shot composition capabilities and cross-dataset transfer to ScanRefer, underscoring the practical potential of language priors for regularizing structured visual reasoning. Overall, LARC offers a general, data-efficient pathway to regularize neuro-symbolic 3D grounding with language priors, advancing naturally supervised learning in complex visual reasoning tasks.

Abstract

3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.
Paper Structure (43 sections, 6 equations, 7 figures, 7 tables)

This paper contains 43 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Compared to prior works, LARC conducts 3D visual grounding in the naturally supervised setting, by training neuro-symbolic concept learners with language regularization.
  • Figure 2: LARC distills constraints from large language models, and injects these rules as regularization into the learning process of structured neuro-symbolic concept learners.
  • Figure 3: Visualizations of LARC's and NS3D's learned features for symmetric (left two columns) and exclusive (right two columns) concepts; each matrix represents likelihood of pairs of objects' relations adhering to the given concept. LARC features learn to encode constraints from language priors significantly more effectively than that of the NS3D baseline.
  • Figure 4: LARC's performance compared to prior works in the naturally supervised setting; each column shows every model's prediction for a given instruction.
  • Figure 5: LARC's neuro-symbolic framework executes symbolic programs hierarchically to retrieve the target answers.
  • ...and 2 more figures