Table of Contents
Fetching ...

Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices

Jakub Kwiatkowski, Krzysztof Krawiec

TL;DR

This study proposes a deep learning architecture based on the transformer blueprint which addresses the subgoal of predicting the visual properties of individual objects and their arrangements and considers a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training.

Abstract

Learning to perform abstract reasoning often requires decomposing the task in question into intermediate subgoals that are not specified upfront, but need to be autonomously devised by the learner. In Raven Progressive Matrices (RPM), the task is to choose one of the available answers given a context, where both the context and answers are composite images featuring multiple objects in various spatial arrangements. As this high-level goal is the only guidance available, learning to solve RPMs is challenging. In this study, we propose a deep learning architecture based on the transformer blueprint which, rather than directly making the above choice, addresses the subgoal of predicting the visual properties of individual objects and their arrangements. The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer. We consider a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training. In experimental assessment, the models not only outperform state-of-the-art methods but also provide interesting insights and partial explanations about the inference. The design of the method also makes it immune to biases that are known to be present in some RPM benchmarks.

Learning Abstract Visual Reasoning via Task Decomposition: A Case Study in Raven Progressive Matrices

TL;DR

This study proposes a deep learning architecture based on the transformer blueprint which addresses the subgoal of predicting the visual properties of individual objects and their arrangements and considers a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training.

Abstract

Learning to perform abstract reasoning often requires decomposing the task in question into intermediate subgoals that are not specified upfront, but need to be autonomously devised by the learner. In Raven Progressive Matrices (RPM), the task is to choose one of the available answers given a context, where both the context and answers are composite images featuring multiple objects in various spatial arrangements. As this high-level goal is the only guidance available, learning to solve RPMs is challenging. In this study, we propose a deep learning architecture based on the transformer blueprint which, rather than directly making the above choice, addresses the subgoal of predicting the visual properties of individual objects and their arrangements. The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer. We consider a few ways in which the model parses the visual input into tokens and several regimes of masking parts of the input in self-supervised training. In experimental assessment, the models not only outperform state-of-the-art methods but also provide interesting insights and partial explanations about the inference. The design of the method also makes it immune to biases that are known to be present in some RPM benchmarks.
Paper Structure (21 sections, 3 equations, 10 figures, 11 tables)

This paper contains 21 sections, 3 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: An example of an RPM task.
  • Figure 2: The architecture of the model (yellow boxes) and its training process, guided by the loss function that compares the predicted and actual properties of RPM panels. The model learns from completed RPM tasks, with one of the panels (context or query panel) masked out, and predicts the properties of all panels.
  • Figure 3: Solving two RPM tasks (a and b) with Task and Row models (both trained in Combined masking mode). Left: the task. Middle: the correct answer and rendering of models' predictions $p$ (Step 1 of DCM) for the Row and Task model. Right: answer panels and the renderings of classifications $p_i$ generated by Row and Task model (Step 2 of the DCM). The panels corresponding to the most similar property vectors marked with thicker borders. Predictions and classifications are rendered from property vectors produced by the model while fixing rotation angles, as the angle was irrelevant in these tasks. To facilitate analysis, we render the color property using pseudocoloring (it is conventionally rendered in grayscale). See Figs. \ref{['fig:additional-vis-1']}--\ref{['fig:additional-vis-3']} for more examples.
  • Figure SM1: Histograms of absolute difference errors committed by the models on object properties. Analogous distributions for untrained (random) models (not shown here for clarity) are very close to uniform.
  • Figure SM2: The learning curve in terms of Correct and AvgProp metrics for the Task tokenizer model trained in Combined masking mode, with the Random masking phase terminated by the early stopping condition in 199th epoch and the Query masking phase terminated after 12 epochs.
  • ...and 5 more figures