Open Ad-hoc Categorization with Contextualized Feature Learning
Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren
TL;DR
The paper addresses open ad-hoc categorization, where context-specific categories must be discovered from few labeled exemplars and abundant unlabeled data. It introduces OAK, a simple yet effective framework that adds learnable context tokens to a frozen CLIP backbone and trains with a joint objective that blends semantic guidance (text alignment) and bottom-up visual clustering (GCD). By switching context tokens, OAK contextualizes features to support accurate classification of both known and novel classes across multiple contexts, with interpretable saliency maps and the ability to name discovered clusters. Empirically, OAK outperforms semantic-only and visual-only baselines on Stanford and Clevr-4 across per-context and Omni accuracy, demonstrating robust context switching and strong performance in novel-category discovery. These results advance adaptive, context-aware perception for open-world and ad-hoc tasks in AI systems.
Abstract
Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.
