MotherNet: Fast Training and Inference via Hyper-Network Transformers

Andreas Müller; Carlo Curino; Raghu Ramakrishnan

MotherNet: Fast Training and Inference via Hyper-Network Transformers

Andreas Müller, Carlo Curino, Raghu Ramakrishnan

TL;DR

The paper tackles the challenge of applying foundation-model style learning to tabular data by introducing MotherNet, a transformer-based hypernetwork that generates a compact, two-hidden-layer MLP in a single forward pass via in-context learning on synthetic data. By decoupling dataset-specific gradient descent from inference, MotherNet achieves fast predictions and competitive accuracy on small tabular datasets, outperforming several baselines and offering significant speedups over tuned AutoML approaches. The approach relies on a low-rank weight decomposition to keep the produced child network compact, while meta-training on synthetic tasks provides broad generalization across diverse datasets. Compared to TabPFN and HyperFast, MotherNet delivers faster inference with comparable or superior performance on small datasets, suggesting a practical path for fast, tuning-free foundation-model-type models in tabular domains.

Abstract

Foundation models are transforming machine learning across many modalities, with in-context learning replacing classical model training. Recent work on tabular data hints at a similar opportunity to build foundation models for classification for numerical data. However, existing meta-learning approaches can not compete with tree-based methods in terms of inference time. In this paper, we propose MotherNet, a hypernetwork architecture trained on synthetic classification tasks that, once prompted with a never-seen-before training set generates the weights of a trained ``child'' neural-network by in-context learning using a single forward pass. In contrast to most existing hypernetworks that are usually trained for relatively constrained multi-task settings, MotherNet can create models for multiclass classification on arbitrary tabular datasets without any dataset specific gradient descent. The child network generated by MotherNet outperforms neural networks trained using gradient descent on small datasets, and is comparable to predictions by TabPFN and standard ML methods like Gradient Boosting. Unlike a direct application of TabPFN, MotherNet generated networks are highly efficient at inference time. We also demonstrate that HyperFast is unable to perform effective in-context learning on small datasets, and heavily relies on dataset specific fine-tuning and hyper-parameter tuning, while MotherNet requires no fine-tuning or per-dataset hyper-parameters.

MotherNet: Fast Training and Inference via Hyper-Network Transformers

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 8 figures, 11 tables)

This paper contains 20 sections, 2 equations, 8 figures, 11 tables.

Introduction
Related Work
LLMs for Tabular Data
TabPFN
Learning to Learn and Meta-learning
Methodology
Limitations of TabPFN
MotherNet: Generating Model Weights
In-Context Learning with MotherNet
Distillation Baseline
Experimental Evaluation
OpenML CC18 (small)
TabZilla
Limitations and Future Work
Conclusion
...and 5 more sections

Figures (8)

Figure 1: MotherNet architecture. Given training data $(x_1, y_1),\dots,(x_r, y_r)$, the transformer produces a vector $\phi$, which is reshaped to weight matrices of an MLP with low-rank weight structure. Green blocks are individual data points and their activations, orange layers are activations created during in-context learning or in the forward-pass during meta-learning, and light blue layers are learned during meta-training.
Figure 2: Comparison of TabPFN, HyperFast, MLP-distill and MotherNet with tuned baselines over the test datasets of hollmann2022tabpfn, listed in Table \ref{['datasets']}. † means 1h of HPO on CPU, ‡ means 1h of HPO on GPU. Left: Comparison of normalized mean ROC AUC. Right: Critical Differences Diagram demsar2006statistical.
Figure 3: Comparing decision boundaries on synthetic toy datasets, adapted from the scikit-learnpedregosa2011scikit documentation. MotherNet decision boundaries closely resemble TabPFN and to a lesser degree traditional Neural Network boundaries.
Figure 4: Comparison of TabPFN, HyperFast, MLP-distill and MotherNet with tuned baselines over the validation datasets of hollmann2022tabpfn, listed in Table \ref{['validation_datasets']}. † means 1h of HPO on CPU, ‡ means 1h of HPO on GPU. Left: Comparison of normalized mean ROC AUC. Right: Critical Differences Diagram demsar2006statistical. Compare with Figure \ref{['fig:test_set_roc_cd']} for the test set. We did not include HyperFast, MLP and ResNet because of the extreme computational cost of tuning these models for each dataset.
Figure 5: Left: Visualizing runtime vs normalize AUC trade-off, based on numbers in Table \ref{['tab:cc18results']}, not considering tuning times for the competing algorithms. Bottom right means fast and accurate algorithms. Note that the y-axis is in log-scale. Right: Critical Difference Diagram on the TabZilla benchmark using ROC AUC, corresponding to results in Table 2.
...and 3 more figures

MotherNet: Fast Training and Inference via Hyper-Network Transformers

TL;DR

Abstract

MotherNet: Fast Training and Inference via Hyper-Network Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (8)