Table of Contents
Fetching ...

In-Context Compositional Learning via Sparse Coding Transformer

Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu

TL;DR

This paper tackles in-context compositional learning by rethinking the Transformer attention through sparse coding. It introduces two input-dependent dictionaries, an encoding $\phi(\cdot)$ and a decoding $\psi(\cdot)$, and enforces sparsity on the resulting coefficients $\boldsymbol{\alpha}$ to reveal and preserve compositional structure. Target task coefficients are estimated as a linear combination of context-task coefficients via a lifting-inspired scheme with learnable weights $\lambda_i$, enabling transfer of learned rules across tasks. Empirical results on S-RAVEN and RAVEN show that sparse-coding attention markedly improves compositional generalization and reconstruction quality, outperforming standard Transformers and showing robustness where dense attention fails. The approach also demonstrates potential benefits for language-model reasoning tasks and offers a parameter-efficient path to integrating structured inductive bias into pre-trained architectures, albeit with limitations in scaling to very large models.

Abstract

Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target problems by inferring compositional rules from context examples, which are composed of basic components structured by underlying rules. However, some of these tasks remain challenging for Transformers, which are not inherently designed to handle compositional tasks and offer limited structural inductive bias. In this work, inspired by the principle of sparse coding, we propose a reformulation of the attention to enhance its capability for compositional tasks. In sparse coding, data are represented as sparse combinations of dictionary atoms with coefficients that capture their compositional rules. Specifically, we reinterpret the attention block as a mapping of inputs into outputs through projections onto two sets of learned dictionary atoms: an encoding dictionary and a decoding dictionary. The encoding dictionary decomposes the input into a set of coefficients, which represent the compositional structure of the input. To enhance structured representations, we impose sparsity on these coefficients. The sparse coefficients are then used to linearly combine the decoding dictionary atoms to generate the output. Furthermore, to assist compositional generalization tasks, we propose estimating the coefficients of the target problem as a linear combination of the coefficients obtained from the context examples. We demonstrate the effectiveness of our approach on the S-RAVEN and RAVEN datasets. For certain compositional generalization tasks, our method maintains performance even when standard Transformers fail, owing to its ability to learn and apply compositional rules.

In-Context Compositional Learning via Sparse Coding Transformer

TL;DR

This paper tackles in-context compositional learning by rethinking the Transformer attention through sparse coding. It introduces two input-dependent dictionaries, an encoding and a decoding , and enforces sparsity on the resulting coefficients to reveal and preserve compositional structure. Target task coefficients are estimated as a linear combination of context-task coefficients via a lifting-inspired scheme with learnable weights , enabling transfer of learned rules across tasks. Empirical results on S-RAVEN and RAVEN show that sparse-coding attention markedly improves compositional generalization and reconstruction quality, outperforming standard Transformers and showing robustness where dense attention fails. The approach also demonstrates potential benefits for language-model reasoning tasks and offers a parameter-efficient path to integrating structured inductive bias into pre-trained architectures, albeit with limitations in scaling to very large models.

Abstract

Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target problems by inferring compositional rules from context examples, which are composed of basic components structured by underlying rules. However, some of these tasks remain challenging for Transformers, which are not inherently designed to handle compositional tasks and offer limited structural inductive bias. In this work, inspired by the principle of sparse coding, we propose a reformulation of the attention to enhance its capability for compositional tasks. In sparse coding, data are represented as sparse combinations of dictionary atoms with coefficients that capture their compositional rules. Specifically, we reinterpret the attention block as a mapping of inputs into outputs through projections onto two sets of learned dictionary atoms: an encoding dictionary and a decoding dictionary. The encoding dictionary decomposes the input into a set of coefficients, which represent the compositional structure of the input. To enhance structured representations, we impose sparsity on these coefficients. The sparse coefficients are then used to linearly combine the decoding dictionary atoms to generate the output. Furthermore, to assist compositional generalization tasks, we propose estimating the coefficients of the target problem as a linear combination of the coefficients obtained from the context examples. We demonstrate the effectiveness of our approach on the S-RAVEN and RAVEN datasets. For certain compositional generalization tasks, our method maintains performance even when standard Transformers fail, owing to its ability to learn and apply compositional rules.

Paper Structure

This paper contains 46 sections, 1 theorem, 23 equations, 6 figures, 4 tables.

Key Result

Proposition 9.4

There exists a set of weights $\lambda_1, \dots, \lambda_{L-1}$ such that: and $\boldsymbol{\alpha}_L$ reconstructs $\mathbf{Z}_L$ using only elements in $\psi(\mathbf{X})$.

Figures (6)

  • Figure 1: Illustration of the in-context compositional learning task. The input data includes both the context tasks and the target task. The goal is to solve the target task by inferring and applying the compositional rule observed in the context tasks. (a) Applying the principles of sparse coding to represent the data. Given a dictionary, the input data can be sparsely represented using a set of coefficients that encode underlying compositional rules. Encoding/decoding data: An example of one task is composed of four elements from the dictionary, with indices "6, 0, 0, 4." After one-hot embedding, we obtain a $4 \times 10$ matrix, where each nonzero entry corresponds to a specific element in the dictionary. By stacking all 9 examples, we obtain a $36\times 10$ matrix representing the coefficients. Compositional rules: Each row of the input data follows an underlying pattern. If the first two shapes are constructed as $(A, \emptyset, \emptyset, B)$ and $(B, \emptyset, \emptyset, C)$, where $A$, $B$, and $C$ correspond to unique elements in the dictionary, $\emptyset$ means an empty shape, then the third shape should be $(C, \emptyset, \emptyset, A)$. (b) Representing the compositional rules as coefficients provides an effective way to estimate the coefficients of the target task from those of the context tasks. Once inferred, these coefficients can be decoded into the final output using the dictionary. Details of this task are described in Section \ref{['sec:alys']}.
  • Figure 2: (a) The attention block produces the output as a linear combination of the value matrix, weighted by the attention map. (b) Our framework reformulates the attention mechanism: Outputs are constructed as sparse combinations of learned dictionary atoms, i.e., decoding dictionary$\psi(\mathbf{X})$, and their coefficients $\boldsymbol{\alpha}$ represent compositional rules. (c) Details of our method: The coefficients $\boldsymbol{\alpha}$ are obtained by decomposing the input features over the encoding dictionary$\phi(\mathbf{X})$, and then achieving sparse representations with a nonlinear function $\sigma(\cdot)$. Since the coefficients of the target task only provide partial information about its compositional rule due to limited observations, we propose to estimate the coefficients of the target task $\boldsymbol{\alpha}_L$ as a simple linear combination of the context task coefficients, i.e., $\boldsymbol{\alpha}'=g(\boldsymbol{\alpha})$. Further details are provided in Section \ref{['subsec:method']}.
  • Figure 3: The effectiveness of sparse coefficients (attention map). Models are trained on setting (a) and tested on both setting (a) and novel setting (b), which has a different compositional rule. The baseline method, Transformer with standard MHA, produces blurry outputs due to dense coefficients, which lead to mixed and entangled results. In contrast, our sparse coefficients prevent this blurring and effectively transfer the construction rule from the context tasks to the target task. Further details are in Section \ref{['sec:alys']}.
  • Figure 4: (Table) Accuracy comparison between our method and baseline methods on the Symbolic RAVEN (S-RAVEN) dataset. Our method consistently achieves higher accuracy than baselines. (Plot) Results on the RAVEN dataset. It shows the percentage of test samples with PSNR values exceeding a given threshold. At lower PSNR levels, the baseline method performs similarly to ours. However, for PSNR values above 40, the baseline achieves nearly 0 coverage, whereas our method retains over 30% of the samples.
  • Figure 5: Example results of RAVEN. The model predicts the 9th panel based on the first 8 panels. We compare our method with and without $g(\boldsymbol{\alpha})$, the coefficient estimation for the target task, alongside the baseline method. The baseline often yields blurry images with incorrect layouts, whereas our method preserves structure and improves compositional accuracy. However, all models occasionally fail on the most challenging cases, e.g., (d).
  • ...and 1 more figures

Theorems & Definitions (2)

  • Proposition 9.4
  • proof