Table of Contents
Fetching ...

Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation

He-Yen Hsieh, Wei-Te Mark Ting, H. T. Kung

TL;DR

Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants, and it is shown that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.

Abstract

Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance. Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples. Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization. While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters. To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones. We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters. With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users. Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user. Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants. We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.

Alfa: Attentive Low-Rank Filter Adaptation for Structure-Aware Cross-Domain Personalized Gaze Estimation

TL;DR

Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants, and it is shown that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.

Abstract

Pre-trained gaze models learn to identify useful patterns commonly found across users, but subtle user-specific variations (i.e., eyelid shape or facial structure) can degrade model performance. Test-time personalization (TTP) adapts pre-trained models to these user-specific domain shifts using only a few unlabeled samples. Efficient fine-tuning is critical in performing this domain adaptation: data and computation resources can be limited-especially for on-device customization. While popular parameter-efficient fine-tuning (PEFT) methods address adaptation costs by updating only a small set of weights, they may not be taking full advantage of structures encoded in pre-trained filters. To more effectively leverage existing structures learned during pre-training, we reframe personalization as a process to reweight existing features rather than learning entirely new ones. We present Attentive Low-Rank Filter Adaptation (Alfa) to adapt gaze models by reweighting semantic patterns in pre-trained filters. With Alfa, singular value decomposition (SVD) extracts dominant spatial components that capture eye and facial characteristics across users. Via an attention mechanism, we need only a few unlabeled samples to adjust and reweight pre-trained structures, selectively amplifying those relevant to a target user. Alfa achieves the lowest average gaze errors across four cross-dataset gaze benchmarks, outperforming existing TTP methods and low-rank adaptation (LoRA)-based variants. We also show that Alfa's attentive low-rank methods can be applied to applications beyond vision, such as diffusion-based language models.
Paper Structure (34 sections, 16 equations, 16 figures, 11 tables)

This paper contains 34 sections, 16 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Alfa achieves the lowest average gaze error, with the smallest model size, across four cross-dataset benchmarks: from ETH-XGaze to MPIIGaze, ETH-XGaze to EyeDiap, Gaze360 to MPIIGaze, and Gaze360 to EyeDiap. Top: Comparison with other test-time personalization (TTP) methods. Baseline refers to a ResNet-18 without fine-tuning. Bottom: Comparison with low-rank adaptation (LoRA)-based variants.
  • Figure 2: Overview of Attentive Low-Rank Filter Adaptation (Alfa). (a) The pre-trained weight matrix is approximated using truncated SVD: $W_d = U_d S_d V_d^\top$. Then, a tunable low-rank update $\Delta W$ is added for adaptation. (b) Alfa adapts gaze models by reweighting spatial structures encoded in pre-trained filters. Alfa extracts dominant spatial patterns ($V_{\text{base}} = S_d V_d^\top$) via singular value decomposition (SVD). For personalization, multi-head low-rank modules $A^\mathcal{Q}$ and $B^\mathcal{Q}$ generate query weights, and $V_{\text{base}}$ and $V_{\text{base}}^\top$ are reused as key and value matrices. Using multi-head scaled dot-product attention, Alfa identifies the spatial structures most relevant to a target user. Alfa aggregates this into a personalized update using additional low-rank modules $A^\mathcal{P}$ and $B^\mathcal{P}$, forming $V_{\text{Alfa}}$, which encodes gaze-specific adaptations informed by the pre-trained spatial structure.
  • Figure 3: Comparison with other LoRA-based variants, including LoRA, MiLoRA, Spectral Adapter$^A$, DoRA, MoSLoRA, MELoRA, and FLoRA. Alfa selectively reuses semantic patterns encoded in pre-trained weights and activates the most relevant ones during adaptation. Only blocks with red backgrounds are tunable. Best viewed in color.
  • Figure 4: Spatial patterns captured during pre-training. Visualizations use rank slices from SVD-decomposed weights of ResNet-18 (pre-trained on ETH-XGaze) from conv1 and the first block of layer3. Left column: visualization of encoded pattern. Middle and right columns: activations for Subject 0 and 16 from ETH-XGaze using $U_d[:, s] S_d[s] V_d^\top[s]$ for slice $s$. Red regions indicate higher activations.
  • Figure 5: Visualization of low-rank updates $\Delta W$ on the MPIIGaze test set for LoRA and Alfa using filters from conv1 and the first block of layer3 in the ResNet-18 model (pre-trained on ETH-XGaze). Red regions indicate higher activation values. When using LoRA updates, model is highly inconsistent with respect to the significant regions of focus across users. In contrast, Alfa captures localized regions consistently across users. This shows Alfa identifies useful components that translate well between source and target domains from the semantic base dictionary. Reweighting these components allows for effective adaptation.
  • ...and 11 more figures