Table of Contents
Fetching ...

Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

Yaxin Luo, Zhiqiang Shen

Abstract

The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

Abstract

The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

Paper Structure

This paper contains 20 sections, 1 theorem, 12 equations, 14 figures, 15 tables.

Key Result

Theorem 4.1

Suppose $x$ are drawn i.i.d. from a distribution with covariance $\Sigma_x$, and the initial weights $w$ in the first layer are drawn from an isotropic (e.g., Gaussian) distribution. Then when we train a neural network on these inputs $x$ using random labels (under typical conditions such as SGD tra

Figures (14)

  • Figure 1: In cross-domain adaptation, the data type remains the same, though domains may vary in style or distribution. In contrast, cross-modality adaptation involves fundamentally different feature spaces, alongside variations in style or distribution.
  • Figure 2: Outlier parameters and weight distributions in models trained on different modalities. Language‐pretrained GPT-2 shows a markedly heavier‐tailed distribution with numerous large‐magnitude "Outlier" weights, whereas vision‐pretrained ViT and GPT-2 structure model trained on images exhibit fewer outliers and narrower spreads.
  • Figure 3: Train and test acc. curves for pretrained vs. scratch GPT-2 on CIFAR-10 using varying random label proportions (0%, 15%, 30%, 100%). Pretrained models consistently show higher accuracy and faster convergence, indicating enhanced robustness to label noise.
  • Figure 4: Loss landscape on CIFAR-10. We visualize a 2D cross-section of the high-dimensional loss surface by plotting $L(\theta_0+\alpha d_1+\beta d_2)$, where $d_1=\theta_T-\theta_0$ is the training direction and $d_2$ is a random direction orthogonal to $d_1$ (both per-layer normalized). The X-axis and Y-axis correspond to the coefficients $\alpha$ and $\beta$, respectively (height/color indicates loss). Top: Correct Labels Training. Bottom: 100% Random Labels Training. Left: Scratch GPT-2. Right: Pretrained GPT-2.
  • Figure 5: t-SNE embeddings for pretrained (top) and scratch-initialized (bottom) GPT-2 models on CIFAR-10 under 100% random labels. Left: Before training, both appear partly disordered. Right: After training, the pretrained model achieves far tighter, more coherent clusters, indicating superior feature separability compared to the scratch model.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Definition 3.1: Cross-Modality Adaptation
  • Theorem 4.1: Random Label Training Induces Covariance Alignment