Table of Contents
Fetching ...

Visual Fixation-Based Retinal Prosthetic Simulation

Yuli Wu, Do Dinh Tan Nguyen, Henning Konermann, Rüveyda Yilmaz, Peter Walter, Johannes Stegmaier

Abstract

This study proposes a retinal prosthetic simulation framework driven by visual fixations, inspired by the saccade mechanism, and assesses performance improvements through end-to-end optimization in a classification task. Salient patches are predicted from input images using the self-attention map of a vision transformer to mimic visual fixations. These patches are then encoded by a trainable U-Net and simulated using the pulse2percept framework to predict visual percepts. By incorporating a learnable encoder, we aim to optimize the visual information transmitted to the retinal implant, addressing both the limited resolution of the electrode array and the distortion between the input stimuli and resulting phosphenes. The predicted percepts are evaluated using the self-supervised DINOv2 foundation model, with an optional learnable linear layer for classification accuracy. On a subset of the ImageNet validation set, the fixation-based framework achieves a classification accuracy of 87.72%, using computational parameters based on a real subject's physiological data, significantly outperforming the downsampling-based accuracy of 40.59% and approaching the healthy upper bound of 92.76%. Our approach shows promising potential for producing more semantically understandable percepts with the limited resolution available in retinal prosthetics.

Visual Fixation-Based Retinal Prosthetic Simulation

Abstract

This study proposes a retinal prosthetic simulation framework driven by visual fixations, inspired by the saccade mechanism, and assesses performance improvements through end-to-end optimization in a classification task. Salient patches are predicted from input images using the self-attention map of a vision transformer to mimic visual fixations. These patches are then encoded by a trainable U-Net and simulated using the pulse2percept framework to predict visual percepts. By incorporating a learnable encoder, we aim to optimize the visual information transmitted to the retinal implant, addressing both the limited resolution of the electrode array and the distortion between the input stimuli and resulting phosphenes. The predicted percepts are evaluated using the self-supervised DINOv2 foundation model, with an optional learnable linear layer for classification accuracy. On a subset of the ImageNet validation set, the fixation-based framework achieves a classification accuracy of 87.72%, using computational parameters based on a real subject's physiological data, significantly outperforming the downsampling-based accuracy of 40.59% and approaching the healthy upper bound of 92.76%. Our approach shows promising potential for producing more semantically understandable percepts with the limited resolution available in retinal prosthetics.

Paper Structure

This paper contains 14 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: (a) Original image. (b) Trace of saccades of the human eye viewing a still image (wiki:saccade under CC BY-SA 2.0). (c) Self-attention from a Vision Transformer in DINOv2 oquab2024dinov.
  • Figure 2: Overview of the visual fixation-based retinal prosthetic simulation pipeline. Salient patches are extracted using a fixation predictor informed by the attention scores from a Vision Transformer (ViT) dosovitskiy2021an, as implemented in the self-supervised DINOv2 pre-trained foundation model caron2021emergingoquab2024dinov. These fixation patches are then encoded by a U-Net ronneberger2015u to optimize the stimulus, which is trained for classification tasks on simulated percepts generated by the pulse2percept framework michael_beyeler-proc-scipy-2017. The performance is evaluated using the frozen DINOv2 backbone, with the option of applying learnable linear probing.
  • Figure 3: Classification accuracy ($x$) w.r.t. the ratio of the most salient fixation patches ($y$).
  • Figure 4: A virtual retinal implant with 196 electrodes (14$\times$14) on the retinal axon map michael_beyeler-proc-scipy-2017beyeler2019model.
  • Figure 5: Visualization. Simulated visual fixations are preserved in (b) and (g-i), best viewed when zoomed in. Percepts refer to the predicted phosphenes generated from the original stimuli, while percepts* are those from the encoded stimuli with a U-Net ronneberger2015u. The Axon Map Model beyeler2019model is used to predict percepts with realistic parameters $\rho$ and $\lambda$.