Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment
Yang Hu, Runchen Wang, Stephen Chong Zhao, Xuhui Zhan, Do Hun Kim, Mark Wallace, David A. Tovar
TL;DR
This work addresses how initialization biases shape vision–language representation learning. It introduces Perceptual-Initialization (PI), a two-stage pipeline that first trains a ViT-B/32 vision encoder to reproduce human triplet embeddings from NIGHTS, then jointly pretrains on 15M image–text pairs (YFCC15M) with a standard CLIP objective, keeping the perceptual prior embedded from the start. PI yields faster and broader zero-shot gains across 29 benchmarks, including ImageNet variants, VTAB, and retrieval tasks, and exhibits stronger scaling with data than a web-only baseline. Importantly, the study shows that initializing with human perceptual structure outperforms post-hoc perceptual fine-tuning, highlighting the value of integrating human priors directly into pretraining for immediate, generalizable vision–language alignment.
Abstract
We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.
