Table of Contents
Fetching ...

Inferring and Executing Programs for Visual Reasoning

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

TL;DR

This work tackles visual reasoning by shifting from end-to-end input-output mappings to explicit, compositional reasoning. It introduces a learnable program generator that constructs a plan from a fixed function vocabulary and a neural execution engine that implements the plan as a dynamic neural module network, trained with backpropagation and REINFORCE. On CLEVR, the approach achieves strong results with limited program supervision and demonstrates strong generalization to novel question types, attribute combinations, and human-posed queries, including when only a fraction of programs are available. The results highlight the value of explicit, trainable programs for visual question answering and point to future work in expanding the module set and enabling automatic discovery of new modules for broader reasoning tasks.

Abstract

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases in the data rather than learning to perform visual reasoning. Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer. Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE. Using the CLEVR benchmark for visual reasoning, we show that our model significantly outperforms strong baselines and generalizes better in a variety of settings.

Inferring and Executing Programs for Visual Reasoning

TL;DR

This work tackles visual reasoning by shifting from end-to-end input-output mappings to explicit, compositional reasoning. It introduces a learnable program generator that constructs a plan from a fixed function vocabulary and a neural execution engine that implements the plan as a dynamic neural module network, trained with backpropagation and REINFORCE. On CLEVR, the approach achieves strong results with limited program supervision and demonstrates strong generalization to novel question types, attribute combinations, and human-posed queries, including when only a fraction of programs are available. The results highlight the value of explicit, trainable programs for visual question answering and point to future work in expanding the module set and enabling automatic discovery of new modules for broader reasoning tasks.

Abstract

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases in the data rather than learning to perform visual reasoning. Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer. Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE. Using the CLEVR benchmark for visual reasoning, we show that our model significantly outperforms strong baselines and generalizes better in a variety of settings.

Paper Structure

This paper contains 25 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Compositional reasoning is a critical component needed for understanding the complex visual scenes encountered in applications such as robotic navigation, autonomous driving, and surveillance. Current models fail to do such reasoning johnson2017clevr.
  • Figure 2: System overview. The program generator is a sequence-to-sequence model which inputs the question as a sequence of words and outputs a program as a sequence of functions, where the sequence is interpreted as a prefix traversal of the program's abstract syntax tree. The execution engine executes the program on the image by assembling a neural module network andreas2016neural mirroring the structure of the predicted program.
  • Figure 3: Visualizations of the norm of the gradient of the sum of the predicted answer scores with respect to the final feature map. From left to right, each question adds a module to the program; the new module is underlined in the question. The visualizations illustrate which objects the model attends to when performing the reasoning steps for question answering. Images are from the validation set.
  • Figure 4: Accuracy of predicted programs (left) and answers (right) as we vary the number of ground-truth programs. Blue and green give accuracy before and after joint finetuning; the dashed line shows accuracy of our strongly-supervised model.
  • Figure 5: Question answering accuracy on the CLEVR-CoGenT dataset (higher is better). Top: We train models on Condition A, then test them on both Condition A and Condition B. We then finetune these models on Condition B using 3K images and 30K questions, and again test on both Conditions. Our model uses 18K programs during training on Condition A, and does not use any programs during finetuning on Condition B. Bottom: We investigate the effects of using different amounts of data when finetuning on Condition B. We show overall accuracy as well as accuracy on color-query and shape-query questions.
  • ...and 2 more figures