Table of Contents
Fetching ...

A Theory-Inspired Framework for Few-Shot Cross-Modal Sketch Person Re-Identification

Yunpeng Gong, Yongjie Hou, Jiangming Shi, Kim Long Diep, Min Jiang

TL;DR

KTCAA introduces a theory-guided meta-learning framework for few-shot cross-modal sketch-to-RGB person re-identification, addressing the RGB-to-sketch transfer gap via two modules: Alignment Augmentation (AA) to reduce domain discrepancy and Knowledge Transfer Catalyst (KTC) to enhance perturbation invariance. Grounded in a generalization bound that includes a discrepancy term $\\tfrac{1}{2} d_{\\mathcal{H} \\Delta \\mathcal{H}}(\\mathcal{D}_S, \\mathcal{D}_T)$ and a perturbation term $L \\cdot \\gamma$, the framework jointly optimizes AA, KTC, and a contrastive loss $L_C$ under a meta-learning regime to transfer RGB knowledge to sketch scenarios. Empirical results on PKU-Sketch and Market-Sketch-1K show state-of-the-art performance under data scarcity, with significant gains from the two modules and strong cross-domain generalization without relying on extensive target-domain labels. Overall, KTCAA offers a principled approach to cross-modal transfer in low-data settings, delivering practical improvements for sketch-based recognition systems.

Abstract

Sketch based person re-identification aims to match hand-drawn sketches with RGB surveillance images, but remains challenging due to significant modality gaps and limited annotated data. To address this, we introduce KTCAA, a theoretically grounded framework for few-shot cross-modal generalization. Motivated by generalization theory, we identify two key factors influencing target domain risk: (1) domain discrepancy, which quantifies the alignment difficulty between source and target distributions; and (2) perturbation invariance, which evaluates the model's robustness to modality shifts. Based on these insights, we propose two components: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target distributions and facilitate progressive alignment; and (2) Knowledge Transfer Catalyst (KTC), which enhances invariance by introducing worst-case perturbations and enforcing consistency. These modules are jointly optimized under a meta-learning paradigm that transfers alignment knowledge from data-rich RGB domains to sketch-based scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly in data-scarce conditions.

A Theory-Inspired Framework for Few-Shot Cross-Modal Sketch Person Re-Identification

TL;DR

KTCAA introduces a theory-guided meta-learning framework for few-shot cross-modal sketch-to-RGB person re-identification, addressing the RGB-to-sketch transfer gap via two modules: Alignment Augmentation (AA) to reduce domain discrepancy and Knowledge Transfer Catalyst (KTC) to enhance perturbation invariance. Grounded in a generalization bound that includes a discrepancy term and a perturbation term , the framework jointly optimizes AA, KTC, and a contrastive loss under a meta-learning regime to transfer RGB knowledge to sketch scenarios. Empirical results on PKU-Sketch and Market-Sketch-1K show state-of-the-art performance under data scarcity, with significant gains from the two modules and strong cross-domain generalization without relying on extensive target-domain labels. Overall, KTCAA offers a principled approach to cross-modal transfer in low-data settings, delivering practical improvements for sketch-based recognition systems.

Abstract

Sketch based person re-identification aims to match hand-drawn sketches with RGB surveillance images, but remains challenging due to significant modality gaps and limited annotated data. To address this, we introduce KTCAA, a theoretically grounded framework for few-shot cross-modal generalization. Motivated by generalization theory, we identify two key factors influencing target domain risk: (1) domain discrepancy, which quantifies the alignment difficulty between source and target distributions; and (2) perturbation invariance, which evaluates the model's robustness to modality shifts. Based on these insights, we propose two components: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target distributions and facilitate progressive alignment; and (2) Knowledge Transfer Catalyst (KTC), which enhances invariance by introducing worst-case perturbations and enforcing consistency. These modules are jointly optimized under a meta-learning paradigm that transfers alignment knowledge from data-rich RGB domains to sketch-based scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly in data-scarce conditions.

Paper Structure

This paper contains 21 sections, 23 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Pipeline of the proposed KTCAA framework, which operates under a meta-learning paradigm. During the meta-training phase, a single-modal RGB dataset is used. The Alignment Augmentation (AA) module applies localized sketch-style transformations to simulate target domain characteristics, modeling modality discrepancies at the image level. This guides the model to progressively align source and target distributions while preserving fine-grained semantics. The Knowledge Transfer Catalyst (KTC) module introduces adversarial perturbations to simulate cross-modal uncertainty and is jointly optimized through meta-learning. The alignment loss $L_{\text{align}}$ between features before and after perturbation, along with the adversarial classification loss $L_{\text{adv}}$, enhances the model’s robustness against detail blur and modality shifts. Additionally, the contrastive loss $L_C$ is jointly optimized with these regularization terms to enhance cross-modal representation learning. During the meta-testing phase, the base model parameters $w$ are frozen, and the updated model $w'$ is fine-tuned for few-shot sketch-based Re-ID, leveraging cross-modal knowledge for improved generalization under domain shifts.
  • Figure 2: Qualitative comparison of retrieval results between SS-reID (a) and our KTCAA (b) on the Market-Sketch-1K dataset. Each row shows top-6 results for a sketch query. Green and red boxes indicate correct and incorrect matches, respectively.