Table of Contents
Fetching ...

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Shantanu Jaiswal, Debaditya Roy, Basura Fernando, Cheston Tan

TL;DR

IPRM is designed as a lightweight and fully-differentiable neural module that can be conveniently applied to both transformer and non-transformer vision-language backbones and notably outperforms prior task-specific methods and transformer-based attention modules across various image and video VQA benchmarks testing distinct complex reasoning capabilities.

Abstract

Complex visual reasoning and question answering (VQA) is a challenging task that requires compositional multi-step processing and higher-level reasoning capabilities beyond the immediate recognition and localization of objects and events. Here, we introduce a fully neural Iterative and Parallel Reasoning Mechanism (IPRM) that combines two distinct forms of computation -- iterative and parallel -- to better address complex VQA scenarios. Specifically, IPRM's "iterative" computation facilitates compositional step-by-step reasoning for scenarios wherein individual operations need to be computed, stored, and recalled dynamically (e.g. when computing the query "determine the color of pen to the left of the child in red t-shirt sitting at the white table"). Meanwhile, its "parallel" computation allows for the simultaneous exploration of different reasoning paths and benefits more robust and efficient execution of operations that are mutually independent (e.g. when counting individual colors for the query: "determine the maximum occurring color amongst all t-shirts"). We design IPRM as a lightweight and fully-differentiable neural module that can be conveniently applied to both transformer and non-transformer vision-language backbones. It notably outperforms prior task-specific methods and transformer-based attention modules across various image and video VQA benchmarks testing distinct complex reasoning capabilities such as compositional spatiotemporal reasoning (AGQA), situational reasoning (STAR), multi-hop reasoning generalization (CLEVR-Humans) and causal event linking (CLEVRER-Humans). Further, IPRM's internal computations can be visualized across reasoning steps, aiding interpretability and diagnosis of its errors.

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

TL;DR

IPRM is designed as a lightweight and fully-differentiable neural module that can be conveniently applied to both transformer and non-transformer vision-language backbones and notably outperforms prior task-specific methods and transformer-based attention modules across various image and video VQA benchmarks testing distinct complex reasoning capabilities.

Abstract

Complex visual reasoning and question answering (VQA) is a challenging task that requires compositional multi-step processing and higher-level reasoning capabilities beyond the immediate recognition and localization of objects and events. Here, we introduce a fully neural Iterative and Parallel Reasoning Mechanism (IPRM) that combines two distinct forms of computation -- iterative and parallel -- to better address complex VQA scenarios. Specifically, IPRM's "iterative" computation facilitates compositional step-by-step reasoning for scenarios wherein individual operations need to be computed, stored, and recalled dynamically (e.g. when computing the query "determine the color of pen to the left of the child in red t-shirt sitting at the white table"). Meanwhile, its "parallel" computation allows for the simultaneous exploration of different reasoning paths and benefits more robust and efficient execution of operations that are mutually independent (e.g. when counting individual colors for the query: "determine the maximum occurring color amongst all t-shirts"). We design IPRM as a lightweight and fully-differentiable neural module that can be conveniently applied to both transformer and non-transformer vision-language backbones. It notably outperforms prior task-specific methods and transformer-based attention modules across various image and video VQA benchmarks testing distinct complex reasoning capabilities such as compositional spatiotemporal reasoning (AGQA), situational reasoning (STAR), multi-hop reasoning generalization (CLEVR-Humans) and causal event linking (CLEVRER-Humans). Further, IPRM's internal computations can be visualized across reasoning steps, aiding interpretability and diagnosis of its errors.

Paper Structure

This paper contains 21 sections, 6 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Complex VQA scenarios (CLEVR-Humans johnson2017inferring, GQA hudson2019gqa, CLEVRER-Humansmao2022clevrer), AGQAgrunde2021agqa and STARwu2021star) wherein combination of iterative (step-by-step) computation (blue phrases) and parallel computation (orange phrases) can be beneficial for reasoning.
  • Figure 2: IPRM's computation flow diagram. First, a new set of N-parallel latent operations $\mathbf{Z_{op}}$ are retrieved from language features $\mathbf{X_{L}}$ conditioned on prior operation states $\mathbf{M_{op}}$. Then, visual features $\mathbf{X_{V}}$ are queried conditioned on both $\mathbf{Z_{op}}$ and prior result states results $\mathbf{M_{res}}$, to form the new results $\mathbf{Z_{res}}$. Finally, both $\mathbf{Z_{res}}$ and $\mathbf{Z_{op}}$ are passed to the Operation Composition Unit (see \ref{['sec:op_compose']}), the output of which becomes the new memory state $\mathbf{M}$.
  • Figure 3: Operation Composition Unit
  • Figure 4: Acc. of IPRM (blue) across program lengths for CLEVR (left) and STAR (right). IPRM has signicantly higher accs. at longer program lengths.
  • Figure 5: IPRM performance on CLEVR-Humans at different training data ratios of Cross- and Concat-Att.
  • ...and 9 more figures