Table of Contents
Fetching ...

Bi-directional Contextual Attention for 3D Dense Captioning

Minjung Kim, Hyung Suk Lim, Soonyoung Lee, Bumsoo Kim, Gunhee Kim

TL;DR

BiCA tackles 3D dense captioning by decoupling localization from global contextual reasoning. It introduces parallel Instance Query and Context Query streams and bi-directional attention (O4C and C4O) to compute object-aware contexts $V^c_a$ and context-aware objects $V^o_a$, enabling captions that exploit global scene structure without sacrificing localization. The method achieves state-of-the-art results on ScanRefer and Nr3D across multiple captioning and localization metrics, demonstrating improved object localization and richer descriptions. By decoupling localization from contextual aggregation, BiCA provides a robust framework for object-level understanding in complex 3D environments with practical implications for vision-language systems in real-world scenes.

Abstract

3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene. Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationships that exist across the entire global scene, not only near the object itself. Second, it faces with contradicting objectives--where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention. Leveraging parallelly decoded instance queries for objects and context queries for non-object contexts, BiCA generates object-aware contexts, where the contexts relevant to each object is summarized, and context-aware objects, where the objects relevant to the summarized object-aware contexts are aggregated. This extension relieves previous methods from the contradicting objectives, enhancing both localization performance and enabling the aggregation of contextual features throughout the global scene; thus improving caption generation performance simultaneously. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.

Bi-directional Contextual Attention for 3D Dense Captioning

TL;DR

BiCA tackles 3D dense captioning by decoupling localization from global contextual reasoning. It introduces parallel Instance Query and Context Query streams and bi-directional attention (O4C and C4O) to compute object-aware contexts and context-aware objects , enabling captions that exploit global scene structure without sacrificing localization. The method achieves state-of-the-art results on ScanRefer and Nr3D across multiple captioning and localization metrics, demonstrating improved object localization and richer descriptions. By decoupling localization from contextual aggregation, BiCA provides a robust framework for object-level understanding in complex 3D environments with practical implications for vision-language systems in real-world scenes.

Abstract

3D dense captioning is a task involving the localization of objects and the generation of descriptions for each object in a 3D scene. Recent approaches have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating the nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationships that exist across the entire global scene, not only near the object itself. Second, it faces with contradicting objectives--where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce BiCA, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with Bi-directional Contextual Attention. Leveraging parallelly decoded instance queries for objects and context queries for non-object contexts, BiCA generates object-aware contexts, where the contexts relevant to each object is summarized, and context-aware objects, where the objects relevant to the summarized object-aware contexts are aggregated. This extension relieves previous methods from the contradicting objectives, enhancing both localization performance and enabling the aggregation of contextual features throughout the global scene; thus improving caption generation performance simultaneously. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods.
Paper Structure (36 sections, 9 equations, 4 figures, 3 tables)

This paper contains 36 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Conceptual illustration of the multi-stage pipeline of BiCA (best viewed in color).
  • Figure 2: The overall pipeline of BiCA. We parallelly generate and decode two sets of queries (i.e., Instance Query and Context Query) that encodes the instance features and the non-object context features throughout the global scene, respectively. The object-aware contexts are calculated per each object by the weighted sum of the context queries, where the weights are calculated by the attention between the decoded instance query and context query. Then, with the object-aware context feature, the context-aware object feature is obtained by the weighted sum of the instances, which is weighted by the attention between the object-aware contexts.
  • Figure 3:
  • Figure 10: Qualitative results on the ScanRefer chen2020scanrefer. The yellow-highlighted sections show information specific to the object itself, the green-highlighted sections describe the relationships between objects, and the blue-highlighted sections depict the spatial position of the object in the 3D scene. Captions underlined in red indicate incorrect descriptions. FTG. represent failures in caption generation due to low IoU.