Table of Contents
Fetching ...

Personalized Vision via Visual In-Context Learning

Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, Mike Zheng Shou

TL;DR

PICO reframes personalized vision as visual in-context learning, leveraging diffusion priors to infer a user-defined transformation from a single exemplar and apply it to new images without retraining. A compact VisRel dataset trains a diffusion transformer to map broad visual relations into a unified latent space, while an attention-guided seed scorer stabilizes test-time inference. Across segmentation and flexible task definitions, PICO outperforms fine-tuning and synthetic-data baselines and generalizes to both recognition and generation. The approach is data-efficient, flexible, and capable of handling novel user-defined tasks at test time. Limitations include extrapolation to tasks outside the learned visual relation space and constraints of the four-panel input format.

Abstract

Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

Personalized Vision via Visual In-Context Learning

TL;DR

PICO reframes personalized vision as visual in-context learning, leveraging diffusion priors to infer a user-defined transformation from a single exemplar and apply it to new images without retraining. A compact VisRel dataset trains a diffusion transformer to map broad visual relations into a unified latent space, while an attention-guided seed scorer stabilizes test-time inference. Across segmentation and flexible task definitions, PICO outperforms fine-tuning and synthetic-data baselines and generalizes to both recognition and generation. The approach is data-efficient, flexible, and capable of handling novel user-defined tasks at test time. Limitations include extrapolation to tasks outside the learned visual relation space and constraints of the four-panel input format.

Abstract

Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

Paper Structure

This paper contains 26 sections, 9 equations, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: Predefined vs. Personalized Vision. Top: traditional, predefined tasks. Bottom: personalized tasks enabled by PICO. Given a new pair $(A \!\to\! A')$ and a query image $B$, our model infers the task in-context and produces $B'$, adapting to novel user-defined tasks at test time.
  • Figure 2: (a) Structured Visual Relation Space. Tasks are organized by semantic complexity (low to high) and spatial locality (local to global), covering diverse task types, color-coded as: ■ restoration/enhancement, ■ physical/geometric estimation, ■ semantic perception, ■ generative manipulation. (b) Training pipeline of PICO.
  • Figure 3: Personalized segmentation with visual prompt control. Given the same query image $B$ and text prompt ("Segment"), PICO produces diverse outputs on $B$ by varying the visual exemplar $(A\!\to\!A')$, controlling task type, style, granularity, and spatial focus.
  • Figure 4: Qualitative comparisons on test-time personalized tasks. Each task is defined by a visual exemplar $(A \!\to\! A')$. We compare PICO with five representative baselines on: (a)(b) watermark removal + style transfer; (c) background-only stylization; (d) contour-only edge detection; and (e) sticker insertion.
  • Figure 4: Quantitative results on w/wo texts.
  • ...and 13 more figures