Table of Contents
Fetching ...

HyFI: Hyperbolic Feature Interpolation for Brain-Vision Alignment

Sangmin Jo, Wootaek Jeong, Da-Woon Heo, Yoohwan Hwang, Heung-Il Suk

Abstract

Recent progress in artificial intelligence has encouraged numerous attempts to understand and decode human visual system from brain signals. These prior works typically align neural activity independently with semantic and perceptual features extracted from images using pre-trained vision models. However, they fail to account for two key challenges: (1) the modality gap arising from the natural difference in the information level of representation between brain signals and images, and (2) the fact that semantic and perceptual features are highly entangled within neural activity. To address these issues, we utilize hyperbolic space, which is well-suited for considering differences in the amount of information and has the geometric property that geodesics between two points naturally bend toward the origin, where the representational capacity is lower. Leveraging these properties, we propose a novel framework, Hyperbolic Feature Interpolation (HyFI), which interpolates between semantic and perceptual visual features along hyperbolic geodesics. This enables both the fusion and compression of perceptual and semantic information, effectively reflecting the limited expressiveness of brain signals and the entangled nature of these features. As a result, it facilitates better alignment between brain and visual features. We demonstrate that HyFI achieves state-of-the-art performance in zero-shot brain-to-image retrieval, outperforming prior methods with Top-1 accuracy improvements of up to +17.3% on THINGS-EEG and +9.1% on THINGS-MEG.

HyFI: Hyperbolic Feature Interpolation for Brain-Vision Alignment

Abstract

Recent progress in artificial intelligence has encouraged numerous attempts to understand and decode human visual system from brain signals. These prior works typically align neural activity independently with semantic and perceptual features extracted from images using pre-trained vision models. However, they fail to account for two key challenges: (1) the modality gap arising from the natural difference in the information level of representation between brain signals and images, and (2) the fact that semantic and perceptual features are highly entangled within neural activity. To address these issues, we utilize hyperbolic space, which is well-suited for considering differences in the amount of information and has the geometric property that geodesics between two points naturally bend toward the origin, where the representational capacity is lower. Leveraging these properties, we propose a novel framework, Hyperbolic Feature Interpolation (HyFI), which interpolates between semantic and perceptual visual features along hyperbolic geodesics. This enables both the fusion and compression of perceptual and semantic information, effectively reflecting the limited expressiveness of brain signals and the entangled nature of these features. As a result, it facilitates better alignment between brain and visual features. We demonstrate that HyFI achieves state-of-the-art performance in zero-shot brain-to-image retrieval, outperforming prior methods with Top-1 accuracy improvements of up to +17.3% on THINGS-EEG and +9.1% on THINGS-MEG.
Paper Structure (56 sections, 40 equations, 13 figures, 9 tables)

This paper contains 56 sections, 40 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: (a) The human visual system processes perceptual and semantic information, and some degradation occurs when neural activity is recorded. (b) Previous works aligned semantic and perceptual features through separate pathways, overlooking their entanglement in brain signals. (c) In contrast, hyperbolic interpolation merges perceptual and semantic features with lower complexity, enhancing alignment with brain signals.
  • Figure 2: (a) The semantic image $\mathbf{x}_v^{s}$ and perceptual image $\mathbf{x}_v^{p}$ are encoded by CLIP and projected via a linear layer, and then lifted onto the hyperboloid via the exponential map. Using a learned weight $t$ derived from the semantic image features, the two image features are interpolated on the hyperbolic manifold. Similarly, EEG inputs are encoded and projected onto the same hyperbolic space. Contrastive learning is then performed on the hyperboloid to bring paired EEG-image representations closer. (b) A schematic view of the hyperbolic embedding space. The interpolated representation $\hat{\mathbf{z}}_v$ lies along the geodesic between the semantic feature $\mathbf{z}_v^{s}$ and the perceptual feature $\mathbf{z}_v^{p}$. Contrastive learning then pulls the EEG feature $\mathbf{z}_b$ toward the target $\hat{\mathbf{z}}_v$.
  • Figure 3: Examples of image augmentations and retrieval results. The semantic image $\mathbf{x}_v^{s}$ and perceptual image $\mathbf{x}_v^{p}$ are generated via fovea blur and Gaussian blur, respectively. Retrieval results using CLIP embedding show that semantic queries return category-relevant matches (e.g., fruits), while perceptual queries retrieve images with similar low-level attributes such as color and shape.
  • Figure 4: Qualitative comparison of image retrieval results. Our method retrieves semantically and perceptually coherent images, while the previous method often suffers from color or semantic inconsistencies.
  • Figure 5: Distributions of embedding distances from the root in (a) CLIP space and (b) hyperbolic space. Interpolated image embeddings lie closer to the root in hyperbolic space, unlike in CLIP space.
  • ...and 8 more figures