Table of Contents
Fetching ...

See It All: Contextualized Late Aggregation for 3D Dense Captioning

Minjung Kim, Hyung Suk Lim, Seung Hwan Kim, Soonyoung Lee, Bumsoo Kim, Gunhee Kim

TL;DR

This work tackles 3D dense captioning, where a model must localize objects and generate descriptive sentences that may reference both object attributes and relationships within a scene. It proposes See-It-All (SIA), a transformer-based pipeline that decouples caption generation into two parallel streams—context query for relational/global context and instance query for localization and attributes—and then fuses their outputs with a novel TGI-Aggregator to produce fully informed captions. The key contributions are the late aggregation paradigm and the TGI-Aggregator, which jointly leverage conText, Global, and Instance cues to improve both localization metrics and caption quality, validated on ScanRefer and Nr3D with state-of-the-art results. This approach advances 3D dense captioning by enabling richer, context-aware descriptions that better describe objects in relation to their surroundings and the overall scene, with practical impact for robotics, AR, and scene understanding.

Abstract

3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries-context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships between multiple objects or with the global scene, then aggregated afterwards (i.e., late aggregation) via simple distance-based measures. To further enhance the quality of contextualized caption generation, we design a novel aggregator to generate a fully informed caption based on the surrounding context, the global environment, and object instances. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.

See It All: Contextualized Late Aggregation for 3D Dense Captioning

TL;DR

This work tackles 3D dense captioning, where a model must localize objects and generate descriptive sentences that may reference both object attributes and relationships within a scene. It proposes See-It-All (SIA), a transformer-based pipeline that decouples caption generation into two parallel streams—context query for relational/global context and instance query for localization and attributes—and then fuses their outputs with a novel TGI-Aggregator to produce fully informed captions. The key contributions are the late aggregation paradigm and the TGI-Aggregator, which jointly leverage conText, Global, and Instance cues to improve both localization metrics and caption quality, validated on ScanRefer and Nr3D with state-of-the-art results. This approach advances 3D dense captioning by enabling richer, context-aware descriptions that better describe objects in relation to their surroundings and the overall scene, with practical impact for robotics, AR, and scene understanding.

Abstract

3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries-context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships between multiple objects or with the global scene, then aggregated afterwards (i.e., late aggregation) via simple distance-based measures. To further enhance the quality of contextualized caption generation, we design a novel aggregator to generate a fully informed caption based on the surrounding context, the global environment, and object instances. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.
Paper Structure (34 sections, 9 equations, 4 figures, 7 tables)

This paper contains 34 sections, 9 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Schematic diagrams illustrating paradigms of 3D dense captioning: (a) features are extracted from object detectors, and their relations are further aggregated to enhance features cai20223djcg (b) proposals are generated by voting, then the local-context features are aggregated with transformer attention chen2023vote2capdetr (c) our proposed SIA separately encodes features with local boundaries and context features without such boundaries, and aggregates the generated caption that involves identical objects afterward (i.e., late aggregation) (d) SIA with further enhanced contextual features generated from our novel TGI-Aggregator ($f^{TGI}$) that aggregates local-context-global features for a more contextualized caption generation.
  • Figure 2: Overall architecture of SIA for 3D dense captioning. The caption query set is each designated to Instance Query Decoder and Context Query Decoder. In the Instance Query Decoder, the caption based on the tight localized area are generated along with object detection. In the Context Query Decoder, captions that require views transcending single object localization such as captions containing relation between multiple objects or relation between the scene are generated. The feature for this Unlocalized Caption Generation is further enhanced with our novel TGI-Aggregator, that contextualizes the feature from conText regions, the Global scene, and Instances.
  • Figure 3: Conceptual illustration of our TGI-Aggregator. The Global Aggregator $G(\cdot)$ aggregates the decoded context query $V^o$ and instance query $V^c$ to construct a global descriptor $V^g$. Then, the instance feature $V_i^o$, the nearest neighbor feature in $V^c$, and the global descriptor $V^g$ are concatenated to construct $V^a$.
  • Figure 4: Qualitative results on the ScanRefer chen2020scanrefer. The yellow-highlighted sections show information specific to the object itself, the green-highlighted sections describes the relationships between objects, and the blue-highlighted sections depict the spatial position of the object in the 3D scene. Captions underlined in red indicate incorrect descriptions. FTG. represent failures in caption generation due to low IoU.