Table of Contents
Fetching ...

Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis

Dylan B. Lewis, Jens Gregor, Hector Santos-Villalobos

Abstract

Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.

Representation Selection via Cross-Model Agreement using Canonical Correlation Analysis

Abstract

Modern vision pipelines increasingly rely on pretrained image encoders whose representations are reused across tasks and models, yet these representations are often overcomplete and model-specific. We propose a simple, training-free method to improve the efficiency of image representations via a post-hoc canonical correlation analysis (CCA) operator. By leveraging the shared structure between representations produced by two pre-trained image encoders, our method finds linear projections that serve as a principled form of representation selection and dimensionality reduction, retaining shared semantic content while discarding redundant dimensions. Unlike standard dimensionality reduction techniques such as PCA, which operate on a single embedding space, our approach leverages cross-model agreement to guide representation distillation and refinement. The technique allows representations to be reduced by more than 75% in dimensionality with improved downstream performance, or enhanced at fixed dimensionality via post-hoc representation transfer from larger or fine-tuned models. Empirical results on ImageNet-1k, CIFAR-100, MNIST, and additional benchmarks show consistent improvements over both baseline and PCA-projected representations, with accuracy gains of up to 12.6%.

Paper Structure

This paper contains 14 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: CCA cross-model alignment pipeline. (a) A backbone encoder $M_X$ and a partner encoder $M_Y$ produce baseline embeddings from the same input image. (b) CCA projects both embeddings into a shared, maximally correlated subspace. (c) The aligned space enables dimensionality reduction while transferring shared structure across models. (d) The resulting representations are evaluated on a downstream classification task.
  • Figure 2: ImageNet-1k classification accuracy of pre-trained vision transformers and classification heads using checkpoints obtained from rw2019timm. The shaded area around each data point represents the number of parameters in the vision transformer.
  • Figure 3: The ImageNet-1k classification accuracy of a linear probe trained using representations of varying amounts of the ImageNet-1k training data produced by a ViT-B and ViT-L.
  • Figure 4: Classification accuracy of a linear probe trained on representations of increasingly imbalanced subsets of the Caltech-101 dataset. These representations are produced by a ViT-B and ViT-L trained using the CLIP objective. The X-axis is the maximum allowed ratio between the number of samples in the least common class and the number of samples in the most common class.
  • Figure 5: Comparing CCA improvements over baseline as a function of the parameter ratio between backbone and partner models.

Theorems & Definitions (5)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5