On the generalization capacity of neural networks during generic multimodal reasoning

Takuya Ito; Soham Dan; Mattia Rigotti; James Kozloski; Murray Campbell

On the generalization capacity of neural networks during generic multimodal reasoning

Takuya Ito, Soham Dan, Mattia Rigotti, James Kozloski, Murray Campbell

TL;DR

This work tackles multimodal generalization by evaluating base neural architectures (RNNs, GRUs, Transformers, and Perceivers) on the configurable gCOG benchmark, which tests distractor, systematic, and productive OOD generalization. It demonstrates that cross-attention across modalities and deeper attention layers improve distractor and systematic generalization, while productive compositional generalization remains beyond reach for encoder-only models, even with scaling. The study provides a principled, reproducible evaluation framework and shows that while scale and cross-modal integration help in some OOD settings, new architectural ideas are needed for productive generalization in multimodal reasoning. The gCOG benchmark and findings offer a concrete direction for future research aiming to enhance neural multimodal reasoning capabilities with scalable, controllable evaluations.

Abstract

The advent of the Transformer has led to the development of large language models (LLM), which appear to demonstrate human-like capabilities. To assess the generality of this class of models and a variety of other base neural network architectures to multimodal domains, we evaluated and compared their capacity for multimodal generalization. We introduce a multimodal question-answer benchmark to evaluate three specific types of out-of-distribution (OOD) generalization performance: distractor generalization (generalization in the presence of distractors), systematic compositional generalization (generalization to new task permutations), and productive compositional generalization (generalization to more complex tasks structures). We found that across model architectures (e.g., RNNs, Transformers, Perceivers, etc.), models with multiple attention layers, or models that leveraged cross-attention mechanisms between input domains, fared better. Our positive results demonstrate that for multimodal distractor and systematic generalization, either cross-modal attention or models with deeper attention layers are key architectural features required to integrate multimodal inputs. On the other hand, neither of these architectural features led to productive generalization, suggesting fundamental limitations of existing architectures for specific types of multimodal generalization. These results demonstrate the strengths and limitations of specific architectural components underlying modern neural models for multimodal reasoning. Finally, we provide Generic COG (gCOG), a configurable benchmark with several multimodal generalization splits, for future studies to explore.

On the generalization capacity of neural networks during generic multimodal reasoning

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 11 figures)

This paper contains 19 sections, 7 equations, 11 figures.

Introduction
Related work
Contributions
Experimental design
gCOG for multimodal and compositional evaluation
Model architectures
Results
Distractor generalization
Systematic compositional generalization
Productive compositional generalization
Impact of layer depth and attention heads on generalization
Conclusion
Appendix
Representation analysis of model architectures
Additional details on task design
...and 4 more sections

Figures (11)

Figure 1: gCOG task. We adapted the previously-developed COG task yang_dataset_2018. Our modifications to COG included different task operators, the ability to use categorical tokens to allow for generic testing of multimodal reasoning, and the ability to allow for arbitrarily long task instructions to allow for the evaluation of compositional productivity. A) Task operators and objects serve as the core units of the gCOG task. A task operator (e.g., Exist) is paired with a specific feature combination (e.g., orange + "t"). Feature categories correspond to shape (i.e., letters "a" through "z") and color (i.e., 10 discretely coded colors), but can be naturally extended. B) At minimum, a task must comprise of one specific task operator (e.g., Exist) and a feature combination (e.g., orange "t"). An arbitrary number of stimuli (e.g., images) can be constructed on-the-fly to satisfy this task instruction (i.e., produce a TRUE or FALSE response). C) Tasks can be combined with a conditional operator (e.g., an IF-THEN-ELSE conditional) to increase the task complexity. This enables the construction of arbitrarily complex tasks. While the original COG task explored only task trees of depth 3 (i.e., a single conditional), we relaxed this constraint to allow for arbitrarily long task trees. Dataset: https://github.com/IBM/gcog
Figure 2: Model architectures. We evaluated generalization across six base neural network architectures. A) RNNs and GRUs with 512 hidden units. B) Single Stream Transformer (SSTfmr), which processes task rules and stimuli in a single Transformer, applying self-attention in the Transformer block. C) Dual Stream Transformer (DSTfmr). In contrast to the SSTfmr, two parallel Transformer blocks process rule and image tokens separately, and then process them together in a shared MLP. D) Transformers with cross-attention (CrossAttn). The outputs of two parallel Transformer blocks are processed with a cross-attention mechanism, where the output of the task rule Transformer produces a query, and then the stimulus Transformer block produces a key and value matrix. E) A Perceiver-like architecture, which integrates both task and stimulus output information in a latent Transformer through cross-attention jaegle_perceiver_2021. F) The number of parameters for each model.
Figure 3: Distractor generalization. A) Experimental evaluation for distractor generalization. We trained models on individual task operators (e.g., "Exist red d") on stimuli that included 1 to 5 distractors, and then evaluated OOD generalization performance on stimuli with 10, 20, 30, and 40 distractors. B) Loss and C) accuracy trajectories during training for all models. All models converged to greater than 94% accuracy. D) Distractor generalization performance for each model. We assessed IID distractor generalization (novel stimuli, but with 1 or 5 distractors), and OOD distractor generalization (10, 20, 30, or 40 distractors). For most models, performance reduced as the number of distractors increased. E) We directly compared IID vs. OOD distractor generalization by averaging performance for IID and OOD splits. Models incorporating a cross-attention mechanism -- CrossAttn and Perceiver -- clearly exhibited the best performance.
Figure 4: A) Systematicity on individual task operators, where specific objects (e.g., a blue "a") are trained on a subset of operators, and then tested on distinct set of operators. This evaluates if the model can generalize to new operator and object combinations. B) Training trajectories. C) CrossAttn and Perceiver-like models exhibit excellent systematicity generalization, while other models performed at reduced rates. D) Another test of systematicity is to train on task trees of depth 3, and then test on novel combinations of task trees of depth 3. E) Training trajectories. All models were able to efficiently learn this task variant. (Note that periodic spikes in the loss function are due to a resampling of the training dataset due to model checkpointing and/or disruption to a compute job.) F) While overall generalization performance is lower (even on IID generalization), cross-attention models still perform systematic compositional generalization well above chance.
Figure 5: Productive compositional generalization performance. A) OOD productivity performance of all models to novel tasks of greater complexity (i.e., deeper task trees). We trained models on task trees of depth 1 and depth 3, and then tested generalization to task trees of depth 5 and 7. While the B) training loss and C) training accuracy converged for all models, D) all models failed to perform OOD productive compositional generalization to more complex task trees. (Note that periodic spikes in the loss function are due to a resampling of the training dataset due to model checkpointing and/or disruption to a compute job.)
...and 6 more figures

On the generalization capacity of neural networks during generic multimodal reasoning

TL;DR

Abstract

On the generalization capacity of neural networks during generic multimodal reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)