Table of Contents
Fetching ...

Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

Arya Shah, Vaibhav Tripathi

TL;DR

This work tackles the problem of cross-species representational invariance between human and feline vision by evaluating a wide range of frozen encoders (CNNs, supervised ViTs, Swin, and self-supervised ViTs) on strictly paired human–cat frames. A biologically informed cat-vision filter is used to stress-test invariances, and a comprehensive suite of metrics—CKA (linear and RBF), RSA, Mantel, MMD, Energy, and 1-Wasserstein—assesses both geometric alignment and distributional shifts across layers. The key finding is that self-supervised ViTs, especially DINO ViT-B/16, show the strongest cross-species alignment with early-layer peaks (mean $CKA$-RBF ≈ $0.8144$, $CKA$-Linear ≈ $0.7446$, $RSA$ ≈ $0.6980$), while CNNs and windowed transformers lag and exhibit larger distributional differences. These results imply that self-supervision coupled with ViT inductive biases yields representational geometries more closely aligned with human vision across species, offering testable neuroscience hypotheses and a robust, reproducible benchmark for future cross-species representation studies.

Abstract

Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.

Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

TL;DR

This work tackles the problem of cross-species representational invariance between human and feline vision by evaluating a wide range of frozen encoders (CNNs, supervised ViTs, Swin, and self-supervised ViTs) on strictly paired human–cat frames. A biologically informed cat-vision filter is used to stress-test invariances, and a comprehensive suite of metrics—CKA (linear and RBF), RSA, Mantel, MMD, Energy, and 1-Wasserstein—assesses both geometric alignment and distributional shifts across layers. The key finding is that self-supervised ViTs, especially DINO ViT-B/16, show the strongest cross-species alignment with early-layer peaks (mean -RBF ≈ , -Linear ≈ , ), while CNNs and windowed transformers lag and exhibit larger distributional differences. These results imply that self-supervision coupled with ViT inductive biases yields representational geometries more closely aligned with human vision across species, offering testable neuroscience hypotheses and a robust, reproducible benchmark for future cross-species representation studies.

Abstract

Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF , mean CKA-linear , mean RSA ), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA at block8; ViT-L/16 at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.

Paper Structure

This paper contains 18 sections, 19 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Process Overview for Measuring Human-Cat Invariant Representations in CNNs, ViTs and Self-Supervised ViTs. A total of 191 videos containing cat POV videos are sourced from the internet. Our biologically informed cat vision filter is applied to individual frames and we create pairs of original (human) vs. cat vision filtered frames which pass through a suite of frozen vision encoders and the extracted features are then subjected to statistical tests.
  • Figure 2: Moving counter-clockwise, Panel (a) depicts the cat's photoreceptor spectral sensitivity curves based on biological data, Panel (b) depicts the temporal frequency response of cat vs. human visual system, Panel (c) shows the cat's vertical slit pupil kernel in 3:1 aspect ratio. (d) shows the cat's spectral and visual acuity map based on our biologically informed implementation of cat's pupil, and Panel
  • Figure 3: CNN embeddings with t-SNE (left) and UMAP (right). Colors encode domains (human vs. cat) and marker shapes encode models within the family. These panels are intended to assess domain-level overlap by visual inspection: color mixing indicates cross-domain similarity, while separated color clusters indicate stronger domain-specific structure; shape differences reveal whether such trends are consistent across CNN variants.
  • Figure 4: Transformer embeddings with t-SNE (left) and UMAP (right). Colors (domains) and marker shapes (models) follow Figure \ref{['fig:cnn-embeddings']}. Showing both t-SNE and UMAP allows a robustness check: consistent patterns across methods lend confidence, while differences may reflect method-specific neighborhood preservation.
  • Figure 5: DINO embeddings with t-SNE (left) and UMAP (right). Colors denote domains; marker shapes denote DINO variants. Self-supervised representations often yield distinct geometry; these panels enable visual examination of domain separation vs. overlap and whether patterns are consistent across DINO variants.
  • ...and 4 more figures