Table of Contents
Fetching ...

Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

Yang Hu, Runchen Wang, Stephen Chong Zhao, Xuhui Zhan, Do Hun Kim, Mark Wallace, David A. Tovar

TL;DR

This work addresses how initialization biases shape vision–language representation learning. It introduces Perceptual-Initialization (PI), a two-stage pipeline that first trains a ViT-B/32 vision encoder to reproduce human triplet embeddings from NIGHTS, then jointly pretrains on 15M image–text pairs (YFCC15M) with a standard CLIP objective, keeping the perceptual prior embedded from the start. PI yields faster and broader zero-shot gains across 29 benchmarks, including ImageNet variants, VTAB, and retrieval tasks, and exhibits stronger scaling with data than a web-only baseline. Importantly, the study shows that initializing with human perceptual structure outperforms post-hoc perceptual fine-tuning, highlighting the value of integrating human priors directly into pretraining for immediate, generalizable vision–language alignment.

Abstract

We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.

Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

TL;DR

This work addresses how initialization biases shape vision–language representation learning. It introduces Perceptual-Initialization (PI), a two-stage pipeline that first trains a ViT-B/32 vision encoder to reproduce human triplet embeddings from NIGHTS, then jointly pretrains on 15M image–text pairs (YFCC15M) with a standard CLIP objective, keeping the perceptual prior embedded from the start. PI yields faster and broader zero-shot gains across 29 benchmarks, including ImageNet variants, VTAB, and retrieval tasks, and exhibits stronger scaling with data than a web-only baseline. Importantly, the study shows that initializing with human perceptual structure outperforms post-hoc perceptual fine-tuning, highlighting the value of integrating human priors directly into pretraining for immediate, generalizable vision–language alignment.

Abstract

We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.

Paper Structure

This paper contains 22 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Perceptual-Initialization (PI) yields faster, stronger zero‑shot performance.Model initialization. The image encoder is pre-biased with human triplet‑similarity judgment from the NIGHTS dataset, while a control model is fully random‑initialized. Model Training. Both models are then trained with the same image–text contrastive objective on YFCC15M. Zero‑shot evaluation. Without any task‑specific fine‑tuning, the perceptually‑initialized model (blue) consistently outperforms the random baseline (gold).
  • Figure 2: Perceptual-Initialization yields consistent zero-shot gains across all benchmark families.(a) Mean Top-1 accuracy and (b) mean Top-5 accuracy after 32 epochs of YFCC15M pre-training. Perceptual-Initialization surpasses the web-only baseline for every family—ImageNet, ImageNet-OOD, VTAB, Fine-grained & Specialty, and Domain & Small. Numbers above the bars denote the average lift in percentage points (pp). Overall, PI improves performance on 23 of 29 individual classification benchmarks.
  • Figure 3: Zero-shot classification scaling results. Top-1 accuracy (top row) and Top-5 accuracy (bottom row) are shown for five benchmark families—ImageNet, ImageNet OOD, VTAB, Fine-grained & Specialty, and Misc./Domain & Small—plotted against the log-scale of training samples seen (10 M → 300 M) over total of 32 training epochs. The blue curve denotes our Perceptual-Initialization pipeline (NIGHTS20k $\rightarrow$ YFCC15M) and the orange curve the web-only baseline (YFCC15M). Across all families, Perceptual-Initialization attains higher initial accuracy and exhibits larger scaling exponents $\beta$, reflecting steeper performance gains as more data are ingested.
  • Figure 4: Retrieval Tasks Scaling Results. Recall@1 and Recall@5 are plotted (log-scale, number of image–text pairs seen) over successive epochs on YFCC15M for two retrieval directions: (a) Image → Text R@1, (b) Image → Text R@5, (c) Text → Image R@1, and (d) Text → Image R@5. The blue curves show our proposed perceptual initialization method, while the orange curves represent the conventional web‐scale baseline. A performance gap between the two methods becomes apparent after just a few epochs and grows steadily as more data is ingested, underscoring the strong and increasing advantage of our approach with larger training-sample scales.
  • Figure 5: Qualitative comparison of zero-shot retrieval.(a) Image$\rightarrow$Text: For two query images, we list the ground-truth captions (left) and the top-5 captions returned by each model, together with their cosine similarity scores (higher is better). Ground-truth matches are highlighted in bold. The PI model retrieves the correct caption in every case, with higher cosine similarity scores and larger Top-1 margins ($\Delta$) compared to the baseline. (b) Text$\rightarrow$Image: For two query captions, we show the top-5 retrieved images per model, with similarity scores beneath each thumbnail. In the first example, only the PI model retrieves zebras in the top ranks and secures a significantly higher Top-1 score (0.441 vs. 0.386). In the second example, both models retrieve surfing scenes, yet the PI model still secures a better Top-1 score.