Learning What Helps: Task-Aligned Context Selection for Vision Tasks
Jingyu Guo, Emir Konuk, Fredrik Strand, Christos Matsoukas, Kevin Smith
TL;DR
Task-Aligned Context Selection (TACS) enables discriminative vision models to learn which contextual examples truly improve downstream performance by integrating retrieval into the learning objective. It couples a selector with a Downstream Task Network and trains them via a hybrid optimization that combines differentiable relaxation with reinforcement learning. Across 18 diverse datasets, including fine-grained natural images and medical imaging, TACS yields consistent gains over similarity-based retrieval, with notable benefits in data-limited and ambiguous settings and interpretable contextual reasoning patterns. This work reframes retrieval as a task-aware, adaptive component of vision systems, with potential implications for data curation, active learning, and domain-specific visual reasoning.
Abstract
Humans often resolve visual uncertainty by comparing an image with relevant examples, but ViTs lack the ability to identify which examples would improve their predictions. We present Task-Aligned Context Selection (TACS), a framework that learns to select paired examples which truly improve task performance rather than those that merely appear similar. TACS jointly trains a selector network with the task model through a hybrid optimization scheme combining gradient-based supervision and reinforcement learning, making retrieval part of the learning objective. By aligning selection with task rewards, TACS enables discriminative models to discover which contextual examples genuinely help. Across 18 datasets covering fine-grained recognition, medical image classification, and medical image segmentation, TACS consistently outperforms similarity-based retrieval, particularly in challenging or data-limited settings.
