Table of Contents
Fetching ...

DeepOSets: Non-Autoregressive In-Context Learning with Permutation-Invariance Inductive Bias

Shao-Ting Chiu, Junyuan Hong, Ulisses Braga-Neto

TL;DR

This work shows that in-context learning for regression can arise in a non-autoregressive, permutation-invariant architecture called DeepOSets, by fusing set learning (DeepSets) with operator learning (DeepONets). It proves a universal representation: a permutation-invariant ICL operator $\Phi_n$ can be decomposed into a continuous encoder and a continuous decoding operator, and that DeepOSets universal-approximate such operators. Empirically, DeepOSets achieves accurate ICL on linear, shallow neural network, and polynomial regression tasks with far fewer parameters and faster training than autoregressive transformers; the Set Transformer variant (DeepOSets-T) offers higher accuracy in high-dimensional settings, at the cost of increased complexity, mitigated by inducing-point techniques (DeepOSets-TI). The results highlight efficient parallelizable ICL and potential auto-ML capabilities for in-prompt model selection, with practical implications for scalable, robust meta-learning and operator learning in regression problems.

Abstract

In-context learning (ICL) is the remarkable ability displayed by some machine learning models to learn from examples provided in a user prompt without any model parameter updates. ICL was first observed in the domain of large language models, and it has been widely assumed that it is a product of the attention mechanism in autoregressive transformers. In this paper, using stylized regression learning tasks, we demonstrate that ICL can emerge in a non-autoregressive neural architecture with a hard-coded permutation-invariance inductive bias. This novel architecture, called DeepOSets, combines the set learning properties of the DeepSets architecture with the operator learning capabilities of Deep Operator Networks (DeepONets). We provide a representation theorem for permutation-invariant regression learning operators and prove that DeepOSets are universal approximators of this class of operators. We performed comprehensive numerical experiments to evaluate the capabilities of DeepOSets in learning linear, polynomial, and shallow neural network regression, under varying noise levels, dimensionalities, and sample sizes. In the high-dimensional regime, accuracy was enhanced by replacing the DeepSets layer with a Set Transformer. Our results show that DeepOSets deliver accurate and fast results with an order of magnitude fewer parameters than a comparable transformer-based alternative.

DeepOSets: Non-Autoregressive In-Context Learning with Permutation-Invariance Inductive Bias

TL;DR

This work shows that in-context learning for regression can arise in a non-autoregressive, permutation-invariant architecture called DeepOSets, by fusing set learning (DeepSets) with operator learning (DeepONets). It proves a universal representation: a permutation-invariant ICL operator can be decomposed into a continuous encoder and a continuous decoding operator, and that DeepOSets universal-approximate such operators. Empirically, DeepOSets achieves accurate ICL on linear, shallow neural network, and polynomial regression tasks with far fewer parameters and faster training than autoregressive transformers; the Set Transformer variant (DeepOSets-T) offers higher accuracy in high-dimensional settings, at the cost of increased complexity, mitigated by inducing-point techniques (DeepOSets-TI). The results highlight efficient parallelizable ICL and potential auto-ML capabilities for in-prompt model selection, with practical implications for scalable, robust meta-learning and operator learning in regression problems.

Abstract

In-context learning (ICL) is the remarkable ability displayed by some machine learning models to learn from examples provided in a user prompt without any model parameter updates. ICL was first observed in the domain of large language models, and it has been widely assumed that it is a product of the attention mechanism in autoregressive transformers. In this paper, using stylized regression learning tasks, we demonstrate that ICL can emerge in a non-autoregressive neural architecture with a hard-coded permutation-invariance inductive bias. This novel architecture, called DeepOSets, combines the set learning properties of the DeepSets architecture with the operator learning capabilities of Deep Operator Networks (DeepONets). We provide a representation theorem for permutation-invariant regression learning operators and prove that DeepOSets are universal approximators of this class of operators. We performed comprehensive numerical experiments to evaluate the capabilities of DeepOSets in learning linear, polynomial, and shallow neural network regression, under varying noise levels, dimensionalities, and sample sizes. In the high-dimensional regime, accuracy was enhanced by replacing the DeepSets layer with a Set Transformer. Our results show that DeepOSets deliver accurate and fast results with an order of magnitude fewer parameters than a comparable transformer-based alternative.

Paper Structure

This paper contains 14 sections, 2 theorems, 9 equations, 5 figures, 3 tables.

Key Result

Theorem 1

A continuous operator $\Phi_n: (\mathcal{X} \times \mathcal{Y})^n \!\rightarrow\!\mathcal{H}$ is permutation-invariant if and only if it is continuously sum-decomposable through $R^{\binom{n+d+p}{n}}$.

Figures (5)

  • Figure 1: DeepOSets architecture for in-context learning of regression with built-in permutation invariance inductive bias.
  • Figure 2: Learning linear regression with DeepOSets with $n=13$ in-context examples in the training set. The black dots represent 10 in-context test examples corrupted by Gaussian noise $\epsilon \sim \mathcal{N}(0, \sigma^2 = 0.1)$.
  • Figure 3: Performance comparison of DeepOSets-T (with a Set Transformer), its variant with inducing points (DeepOSets-TI), and a transformer baseline on 20-dimensional linear regression. The vertical line denotes the training set size of in-context examples (41), and $m$ indicates the number of inducing points used in DeepOSets-TI.
  • Figure 4: (a) Performance of DeepOSets-T on 20-dimensional shallow neural network regression. The vertical line indicates the size of the training set (101). (b) The same model also performs well on linear regression tasks, achieving results comparable to standard neural network training. These experiments further highlight the consistency of DeepOSets-T when handling long in-context sequences.
  • Figure 5: Polynomial regression results: The blue line (---) represents the leave-one-out polyfit baseline, while the orange line (---) depicts the DeepOSets model. Results are shown across 30 randomly generated functions.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2