Table of Contents
Fetching ...

Personalized Image Descriptions from Attention Sequences

Ruoyu Xue, Hieu Le, Jingyi Xu, Sounak Mondal, Abe Leite, Gregory Zelinsky, Minh Hoai, Dimitris Samaras

TL;DR

DEPER addresses the gap in personalized image descriptions by modeling individual viewing patterns alongside linguistic style. It introduces a three-component subject representation—dual-context encoder, trajectory-informed extractor, and trajectory decoder—and grounds it in a lightweight VLM adapter to generate subject-specific captions without gaze data at test time. Across four datasets, DEPER delivers consistent improvements and demonstrates strong few-shot generalization to unseen subjects, with ablations confirming the critical role of attention dynamics. The approach highlights the value of behavior-aware representations for enhancing human alignment in multimodal systems and opens avenues for broader personalization in vision-language tasks.

Abstract

People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.

Personalized Image Descriptions from Attention Sequences

TL;DR

DEPER addresses the gap in personalized image descriptions by modeling individual viewing patterns alongside linguistic style. It introduces a three-component subject representation—dual-context encoder, trajectory-informed extractor, and trajectory decoder—and grounds it in a lightweight VLM adapter to generate subject-specific captions without gaze data at test time. Across four datasets, DEPER delivers consistent improvements and demonstrates strong few-shot generalization to unseen subjects, with ablations confirming the critical role of attention dynamics. The approach highlights the value of behavior-aware representations for enhancing human alignment in multimodal systems and opens avenues for broader personalization in vision-language tasks.

Abstract

People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.

Paper Structure

This paper contains 19 sections, 5 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: People have distinct viewing habits chen2024beyondxue2025few that shape how they describe an image. Subject A moves between major objects, while Subject B inspects them in detail and in a different order. Our method models these patterns to produce personalized image descriptions. The images and annotations are from pont2020connecting.
  • Figure 2: Overview of DEPER and Caption Generation: DEPER extracts a subject embedding $\mathbf{z}_s$ from a triplet ($I, D_s, T_s$), capturing the personalized viewing patterns and linguistic style. $\mathbf{z}_s$ then conditions a VLM to produce subject-aligned image descriptions. A Dual-Context Encoder aligns perceptual and linguistic information into $\mathbf{Z}_\mathrm{dual}$. A Subject Embedding Extractor then distills $\mathbf{Z}_\mathrm{dual}$ to $\mathbf{z}_s$, yielding personalized attention–linguistic traits. $\mathbf{z}_s$ is distinctive across subjects yet consistent across images, enforced by classification and contrastive losses. A trajectory decoder further encourages $\mathbf{Z}_\mathrm{dual}$ to capture viewing dynamics, and helps $\mathbf{z}_s$ capture a subject's exploration behavior.
  • Figure 3: Qualitative Results show one example per dataset, each with two subject-specific descriptions (subjects 1 and 2). From top to bottom and left to right: COCO-LN pont2020connecting, Flickr30k-LN pont2020connecting, Kollenda et al. kollenda2025individual, and He et al. he2019human. Subject-distinct content is highlighted in red. Qwen+PT is denoted as Q+PT.
  • Figure 4: Qualitative results of DEPER's outputs. We show DEPER’s outputs on seen and unseen splits (first and second rows) of Flickr30k-LN. The first column visualizes DEPER's subject embeddings, where colors denote subjects and each point represents an image–description–trajectory triplet. The second column shows ground-truth attention trajectories with their corresponding nouns and orders; the third column shows reconstructed trajectories from the test set after Stage-2 training.