Table of Contents
Fetching ...

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

Yulu Gan, Phillip Isola

Abstract

Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples $N$ parameter perturbations at random, selects the top $K$, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

Abstract

Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples parameter perturbations at random, selects the top , and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.
Paper Structure (77 sections, 1 theorem, 14 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 77 sections, 1 theorem, 14 equations, 14 figures, 8 tables, 1 algorithm.

Key Result

Proposition H.1

For any valid correlation matrix $\mathbf{C} \in \mathbb{R}^{M \times M}$, the Spectral Discordance $\mathcal{D} = 1 - \frac{1}{M(M-1)} \sum_{j \neq k} \mathbf{C}_{jk}$ is bounded by:

Figures (14)

  • Figure 1: (a) Schematic of the main effects we observe (see Fig \ref{['fig:density']} for a version with real data). Left: Small models live in a needle in a haystack regime, where good solutions to downstream tasks occupy a tiny fraction of the surrounding weights. In this regime, it is important to have a smart search algorithm, such as gradient descent or other forms of iterative optimization. Right: Large models are surrounded by a veritable thicket of task-specific solutions. In this regime, random sampling is sufficient to quickly land on promising adaptations, which can then be ensembled to yield strong behavior, an approach we call RandOpt. (b) Solution density -- i.e. density of task-improving weights in a Gaussian neighborhood of the pretrained weights -- scales with model size. (c) RandOpt is $\mathcal{O}(1)$ in training steps, FLOP-efficient, and competitive in converged accuracy with GRPO and ES. Results are shown on the Countdown task with Olmo-3-7B-Instruct; RandOpt uses 5000 random weight guesses and ensembles the top $K$; K-pass baselines use Test-time Majority Vote (TT-MV). More results are shown in Fig. \ref{['fig:acc_v2']} and Table \ref{['appendix::tab-acc']}.
  • Figure 2: Accuracy landscapes in weight space across model scales and reasoning tasks. We perturb the pretrained Qwen2.5 models (from 0.5B to 32B) with 1000 random weight perturbations and project the perturbed models into 2D using random projection. Colors show relative accuracy change $(\mathrm{acc} - \mathrm{base}) / \mathrm{base} \times 100$ (blue: degraded, white: equivalent, red: improved.) Dashed circles indicate the mean perturbation distance and stars mark the best-performing perturbations. Larger models have warmer landscapes, with more high-performing neighborhoods. The last column shows an RGB visualization where GSM8K, Olympiad and Countdown accuracies are mapped to R,G,B channels; richer colors indicate more task experts.
  • Figure 3: Scaling laws of solution density and diversity (using Qwen-2.5 instruction tuned models). (a) Solution density increases with model scale, showing that larger models have a higher fraction of good solutions. (b) Spectral discordance across model scales, measuring solution diversity. Together, these results demonstrate that larger models have both denser and more diverse solution landscapes in the neighborhood around their pretrained weights.
  • Figure 4: Performance spectra and clustering of random seeds. Sampled vectors possess diverse areas of expertise, with individual seeds specializing in specific tasks. (Left) Performance of 100 random seeds across seven evaluation datasets. Each line represents a specific seed, with four lines highlighted as examples. (Right) PCA visualization of these performance vectors, where seeds with similar behavior cluster together into different groups.
  • Figure 5: Pretraining a model of 1D signals, then probing the local neighborhood around the pretrained weights by random guessing $N=1000$ Gaussian perturbations. The plot shows the autoregressive predictions of a particular linear function (dashed blue line), given an observed context (solid blue line). Gray lines: random $f_{\theta}$'s; Red lines: top-K $f_{\theta}$'s. The figure shows three regimes: (a) No pretraining leads to needle-in-the-haystack search, (b) pretraining on several signal types leads to a thicket, (c) pretraining on just linear functions achieves nearly perfect predictions at pretraining time, hence post-training is at ceiling.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 2.1: Solution Density
  • Definition 2.2: Spectral Discordance
  • Proposition H.1
  • proof