GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe
TL;DR
This work tackles the data-transfer and token-density bottlenecks of RGB-based vision–language models by proposing 2D Gaussian Splatting (2DGS) as a compact, spatially adaptive image representation. It develops a scalable 2DGS pipeline with CUDA-accelerated fitting, structured initialization, and luminance-aware pruning, and couples it with a two-stage CLIP adaptation that leverages a frozen RGB backbone to minimize trainable parameters. On a 12.8M DataComp dataset, GS encoders achieve 90–98% of RGB zero-shot accuracy while reducing input size by 3–23.5× and boosting data-loading speed, suggesting that compact, semantically rich representations can sustain strong multimodal performance. Overall, the paper demonstrates that Gaussian splats are a viable substrate for scalable vision–language alignment and advocates a representation-first approach to multimodal learning that can better suit edge–cloud computing constraints.
Abstract
Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero-shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only 9.7% to 13.8% of the total parameters. On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs 3x to 23.5x relative to pixels. Our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission-efficient for edge-cloud learning.
