Table of Contents
Fetching ...

GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting

Yasmine Omri, Connor Ding, Tsachy Weissman, Thierry Tambe

TL;DR

This work tackles the data-transfer and token-density bottlenecks of RGB-based vision–language models by proposing 2D Gaussian Splatting (2DGS) as a compact, spatially adaptive image representation. It develops a scalable 2DGS pipeline with CUDA-accelerated fitting, structured initialization, and luminance-aware pruning, and couples it with a two-stage CLIP adaptation that leverages a frozen RGB backbone to minimize trainable parameters. On a 12.8M DataComp dataset, GS encoders achieve 90–98% of RGB zero-shot accuracy while reducing input size by 3–23.5× and boosting data-loading speed, suggesting that compact, semantically rich representations can sustain strong multimodal performance. Overall, the paper demonstrates that Gaussian splats are a viable substrate for scalable vision–language alignment and advocates a representation-first approach to multimodal learning that can better suit edge–cloud computing constraints.

Abstract

Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero-shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only 9.7% to 13.8% of the total parameters. On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs 3x to 23.5x relative to pixels. Our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission-efficient for edge-cloud learning.

GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting

TL;DR

This work tackles the data-transfer and token-density bottlenecks of RGB-based vision–language models by proposing 2D Gaussian Splatting (2DGS) as a compact, spatially adaptive image representation. It develops a scalable 2DGS pipeline with CUDA-accelerated fitting, structured initialization, and luminance-aware pruning, and couples it with a two-stage CLIP adaptation that leverages a frozen RGB backbone to minimize trainable parameters. On a 12.8M DataComp dataset, GS encoders achieve 90–98% of RGB zero-shot accuracy while reducing input size by 3–23.5× and boosting data-loading speed, suggesting that compact, semantically rich representations can sustain strong multimodal performance. Overall, the paper demonstrates that Gaussian splats are a viable substrate for scalable vision–language alignment and advocates a representation-first approach to multimodal learning that can better suit edge–cloud computing constraints.

Abstract

Modern vision language pipelines are driven by RGB vision encoders trained on massive image text corpora. While these pipelines have enabled impressive zero-shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over 90x faster fitting and about 97% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only 9.7% to 13.8% of the total parameters. On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs 3x to 23.5x relative to pixels. Our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission-efficient for edge-cloud learning.

Paper Structure

This paper contains 40 sections, 6 equations, 21 figures, 6 tables, 1 algorithm.

Figures (21)

  • Figure 1: Speedup results achieved by our CUDA kernels compared to the Zhu2025LIG baseline, following our batch-aware implementation. Speedups are presented for various batch sizes and Gaussian counts for image resolutions of 224x224. For a batch size of 4096 and 400 Gaussian points per image, we observe a 90.3X speedup compared to the baseline.
  • Figure 2: Visualization of reconstruction results for random vs. structured initialization (Ours) for 2DGS fitting for a fixed number of iterations (3000): structured initialization accelerates convergence and achieves higher perceptual quality than random initialization. This is consistent across various compression ratios (ie, numbers of Gaussian points per image) especially for more aggressive compression ratios.
  • Figure 3: Trade-off between pruning ratio and reconstruction degradation for different Gaussian budgets (400--3136 points), evaluated over 100 Mini-ImageNet samples per configuration. Each marker represents a single hyperparameter setting, while the surrounding shaded KDE envelopes summarize the empirical distribution of $\Delta\mathrm{PSNR}$ for each model size: models with larger initial Gaussian budgets (1600--3136) consistently support higher pruning ratios with minimal loss, while smaller models are more sensitive to sparsification.
  • Figure 4: Visualization of Gaussian splats and reconstructed images for a 3136-point GS fit (2000 iterations). Left: $\lambda_{\mathrm{reg}}{=}0$, $\tau_{\mathrm{th}}{=}0$ (0% pruned: PSNR = 37.43). Right: $\lambda_{\mathrm{reg}}{=}10^{-6}$, $\tau_{\mathrm{th}}{=}0.05$ (23.72% pruned: PSNR = 31.1).
  • Figure 5: Zero-shot classification accuracy on 38 datasets from the CLIP Benchmark for ViT-B-16 (Small) and multiple variants of GS vision encoders (number of gaussian points/img: 3136, 1600, 900, 400). Results are presented for 196 tokens (baseline) and 98 tokens.
  • ...and 16 more figures