Table of Contents
Fetching ...

A Unified View of Abstract Visual Reasoning Problems

Mikołaj Małkiński, Jacek Mańdziuk

TL;DR

This work reframes Abstract Visual Reasoning (AVR) by moving from panel-based, task-specific inputs to a unified representation where each problem instance is rendered as a single image. It introduces UMAVR, a MetaFormer-inspired architecture with TokenMixer and ChannelMixer components designed to reason over unified AVR inputs and output a fixed number of candidate answers. Across multiple AVR datasets, STL and transfer learning reveal that standard large vision models struggle in the unified setting, yet UMAVR achieves competitive performance and shows strong knowledge reuse under transfer and curriculum learning. The results advocate developing universal AVR solvers that leverage shared representations and cross-task learning to advance relational and abstract reasoning in vision systems.

Abstract

The field of Abstract Visual Reasoning (AVR) encompasses a wide range of problems, many of which are inspired by human IQ tests. The variety of AVR tasks has resulted in state-of-the-art AVR methods being task-specific approaches. Furthermore, contemporary methods consider each AVR problem instance not as a whole, but in the form of a set of individual panels with particular locations and roles (context vs. answer panels) pre-assigned according to the task-specific arrangements. While these highly specialized approaches have recently led to significant progress in solving particular AVR tasks, considering each task in isolation hinders the development of universal learning systems in this domain. In this paper, we introduce a unified view of AVR tasks, where each problem instance is rendered as a single image, with no a priori assumptions about the number of panels, their location, or role. The main advantage of the proposed unified view is the ability to develop universal learning models applicable to various AVR tasks. What is more, the proposed approach inherently facilitates transfer learning in the AVR domain, as various types of problems share a common representation. The experiments conducted on four AVR datasets with Raven's Progressive Matrices and Visual Analogy Problems, and one real-world visual analogy dataset show that the proposed unified representation of AVR tasks poses a challenge to state-of-the-art Deep Learning (DL) AVR models and, more broadly, contemporary DL image recognition methods. In order to address this challenge, we introduce the Unified Model for Abstract Visual Reasoning (UMAVR) capable of dealing with various types of AVR problems in a unified manner. UMAVR outperforms existing AVR methods in selected single-task learning experiments, and demonstrates effective knowledge reuse in transfer learning and curriculum learning setups.

A Unified View of Abstract Visual Reasoning Problems

TL;DR

This work reframes Abstract Visual Reasoning (AVR) by moving from panel-based, task-specific inputs to a unified representation where each problem instance is rendered as a single image. It introduces UMAVR, a MetaFormer-inspired architecture with TokenMixer and ChannelMixer components designed to reason over unified AVR inputs and output a fixed number of candidate answers. Across multiple AVR datasets, STL and transfer learning reveal that standard large vision models struggle in the unified setting, yet UMAVR achieves competitive performance and shows strong knowledge reuse under transfer and curriculum learning. The results advocate developing universal AVR solvers that leverage shared representations and cross-task learning to advance relational and abstract reasoning in vision systems.

Abstract

The field of Abstract Visual Reasoning (AVR) encompasses a wide range of problems, many of which are inspired by human IQ tests. The variety of AVR tasks has resulted in state-of-the-art AVR methods being task-specific approaches. Furthermore, contemporary methods consider each AVR problem instance not as a whole, but in the form of a set of individual panels with particular locations and roles (context vs. answer panels) pre-assigned according to the task-specific arrangements. While these highly specialized approaches have recently led to significant progress in solving particular AVR tasks, considering each task in isolation hinders the development of universal learning systems in this domain. In this paper, we introduce a unified view of AVR tasks, where each problem instance is rendered as a single image, with no a priori assumptions about the number of panels, their location, or role. The main advantage of the proposed unified view is the ability to develop universal learning models applicable to various AVR tasks. What is more, the proposed approach inherently facilitates transfer learning in the AVR domain, as various types of problems share a common representation. The experiments conducted on four AVR datasets with Raven's Progressive Matrices and Visual Analogy Problems, and one real-world visual analogy dataset show that the proposed unified representation of AVR tasks poses a challenge to state-of-the-art Deep Learning (DL) AVR models and, more broadly, contemporary DL image recognition methods. In order to address this challenge, we introduce the Unified Model for Abstract Visual Reasoning (UMAVR) capable of dealing with various types of AVR problems in a unified manner. UMAVR outperforms existing AVR methods in selected single-task learning experiments, and demonstrates effective knowledge reuse in transfer learning and curriculum learning setups.
Paper Structure (34 sections, 4 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 34 sections, 4 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Disjoint vs. unified perspective. Contemporary literature considers each AVR problem instance as a set of separate images (a), which leads to task-specific methods with limited applicability to other, even similar, tasks. In contrast, we propose the unified view (b) in which the problem instance is rendered as a single image (c). This viewpoint facilitates the development of general AVR solving models that are inherently capable of incorporating advances from broader CV research.
  • Figure 2: PGM embeddings. The embeddings of PGM matrices ($n_a=2$) from the test split of the Neutral regime, visualized with t-SNE van2008visualizing. For the sake of interpretability, the figure considers matrices with a single rule applied to Shape objects.
  • Figure 3: UMAVR architecture. The left side demonstrates UMAVR's processing of the unified input matrix $\chi^{\text{RPM}}$ resulting in the predicted index of the correct answer $\widehat{y}$, as well as the predicted rule representation $\widehat{r}$. The right side illustrates the architecutre of TokenMixer and ChannelMixer modules. The einops notation rogozhnikov2022einops is used to denote tensor transformations in the Rearrange module, where b, h, w, d denote the batch, height, width, and feature dimension, resp.
  • Figure 4: PGM embeddings. The embeddings of PGM matrices ($n_a=2$) from the test split of the Neutral regime, visualized with t-SNE van2008visualizing. For the sake of interpretability, the figure considers matrices with a single rule applied to Shape objects.
  • Figure 5: I-RAVEN embeddings. The embeddings of I-RAVEN matrices ($n_a=2$) from the test split of the Center-Single configuration, visualized with t-SNE van2008visualizing. For the sake of interpretability we consider matrices in which all but one attributes are governed by the Constant rule.
  • ...and 6 more figures