Table of Contents
Fetching ...

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Wentao Mo, Yang Liu

TL;DR

BridgeQA addresses the data scarcity and concept generalization challenges in 3D-VQA by introducing question-conditioned 2D view selection and a two-branch Twin-Transformer that fuses 2D Vision-Language knowledge with 3D cues. The method preserves pretrained 2D VL capabilities while enabling compact cross-modal interaction through Twin-Transformer cross-attention, and uses a BLIP-based decoder to generate free-form answers. Empirical results on ScanQA and SQA show state-of-the-art EM@1 performance and improved text-similarity metrics, with ablations confirming the effectiveness of 2D/3D fusion, view selection strategy, and decoder choice. This approach demonstrates a practical, scalable way to incorporate rich 2D VL pretraining into 3D-VQA, enhancing generalization to novel 3D concepts.

Abstract

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at $\href{https://github.com/matthewdm0816/BridgeQA}{\text{this URL}}$.

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

TL;DR

BridgeQA addresses the data scarcity and concept generalization challenges in 3D-VQA by introducing question-conditioned 2D view selection and a two-branch Twin-Transformer that fuses 2D Vision-Language knowledge with 3D cues. The method preserves pretrained 2D VL capabilities while enabling compact cross-modal interaction through Twin-Transformer cross-attention, and uses a BLIP-based decoder to generate free-form answers. Empirical results on ScanQA and SQA show state-of-the-art EM@1 performance and improved text-similarity metrics, with ablations confirming the effectiveness of 2D/3D fusion, view selection strategy, and decoder choice. This approach demonstrates a practical, scalable way to incorporate rich 2D VL pretraining into 3D-VQA, enhancing generalization to novel 3D concepts.

Abstract

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at .
Paper Structure (14 sections, 6 equations, 3 figures, 5 tables)

This paper contains 14 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Caveats of current 3D-VQA methods. (a). Current 3D-VQA methods exhibits large generalization gap due to 3D data scarcity. And there are questions with visual concepts that never appeared (e.g., "television cabinet") in either question, answer or object type annotation during training, which are hard for 3D-VQA models to generalize to. (b). Current 3D-VQA methods that explictly incorporate 2D VLMs use top-down views of 3D scenes, which might be too complex with many irrelative visual clues and might be incomplete on relative visual clues for some questions. In BridgeQA, we use question-related views instead, to capture visual context potentially relative to the question.
  • Figure 2: Overview of BridgeQA. (a): In question-conditional view selection, we identify semantic-related 2D views by retrieving images that align with the question's declaration form. This method captures relevant visual cues to enhance the 2D-3D question answering model. (b): Our 2D-3D VQA framework utilizes a Twin-Transformer structure, comprising two branches: a 2D vision-language model (VLM) and a 3D branch of similar structure. We apply a lightweight 2D-3D fusion operation. This integration infuses 2D visual context into 3D VQA without modifying the underlying 2D VLM architecture, preserving the pre-trained 2D VL knowledge while allowing for compact fusion of intermediate representations.
  • Figure 3: Qualitative results on how the 2D and 3D branch mutually help each other to correct the answer. The dotted circle in 3D scene corresponds to the 2D view.