Inferring and Executing Programs for Visual Reasoning
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick
TL;DR
This work tackles visual reasoning by shifting from end-to-end input-output mappings to explicit, compositional reasoning. It introduces a learnable program generator that constructs a plan from a fixed function vocabulary and a neural execution engine that implements the plan as a dynamic neural module network, trained with backpropagation and REINFORCE. On CLEVR, the approach achieves strong results with limited program supervision and demonstrates strong generalization to novel question types, attribute combinations, and human-posed queries, including when only a fraction of programs are available. The results highlight the value of explicit, trainable programs for visual question answering and point to future work in expanding the module set and enabling automatic discovery of new modules for broader reasoning tasks.
Abstract
Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases in the data rather than learning to perform visual reasoning. Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer. Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE. Using the CLEVR benchmark for visual reasoning, we show that our model significantly outperforms strong baselines and generalizes better in a variety of settings.
