Table of Contents
Fetching ...

Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang, Sangwoo Mo, Stella X. Yu, Sima Behpour, Liu Ren

TL;DR

The paper addresses open ad-hoc categorization, where context-specific categories must be discovered from few labeled exemplars and abundant unlabeled data. It introduces OAK, a simple yet effective framework that adds learnable context tokens to a frozen CLIP backbone and trains with a joint objective that blends semantic guidance (text alignment) and bottom-up visual clustering (GCD). By switching context tokens, OAK contextualizes features to support accurate classification of both known and novel classes across multiple contexts, with interpretable saliency maps and the ability to name discovered clusters. Empirically, OAK outperforms semantic-only and visual-only baselines on Stanford and Clevr-4 across per-context and Omni accuracy, demonstrating robust context switching and strong performance in novel-category discovery. These results advance adaptive, context-aware perception for open-world and ad-hoc tasks in AI systems.

Abstract

Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

Open Ad-hoc Categorization with Contextualized Feature Learning

TL;DR

The paper addresses open ad-hoc categorization, where context-specific categories must be discovered from few labeled exemplars and abundant unlabeled data. It introduces OAK, a simple yet effective framework that adds learnable context tokens to a frozen CLIP backbone and trains with a joint objective that blends semantic guidance (text alignment) and bottom-up visual clustering (GCD). By switching context tokens, OAK contextualizes features to support accurate classification of both known and novel classes across multiple contexts, with interpretable saliency maps and the ability to name discovered clusters. Empirically, OAK outperforms semantic-only and visual-only baselines on Stanford and Clevr-4 across per-context and Omni accuracy, demonstrating robust context switching and strong performance in novel-category discovery. These results advance adaptive, context-aware perception for open-world and ad-hoc tasks in AI systems.

Abstract

Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

Paper Structure

This paper contains 24 sections, 5 equations, 19 figures, 20 tables.

Figures (19)

  • Figure 1: We study open ad-hoc categorization such as things to sell at a garage sale to achieve a specific goal ( selling unwanted items). Given the context garage sale, labeled exemplars such as shoes, we need to recognize all items in the scene that can be sold at the garage sale, including novel ones. Supervised models like CLIP focus on 1) closed-world generalization, recognizing other shoes. 2) Novel semantic categories can be discovered by contextual expansion from shoes to hats. Unsupervised methods like GCD discover 3) novel visual clusters, identifying suitcases. Our work unifies these scenarios by discovering the latent context and expanding categories both semantically and visually around it.
  • Figure 2: Open ad-hoc categorization learns diverse categorization rules, dynamically adapting to varying user needs at hand. The same image could be recognized differently depending on the context, such as drinking for action and residential for location. We emphasize the ability to switch between multiple contexts in OAK. Specifically, given 1) a context defined by classes, 2) a few labeled images, and 3) a set of unlabeled images, OAK holistically reasons over labeled and unlabeled images, spanning both known and novel classes, to infer novel concepts and propagate labels across the entire dataset. We show the class names of labeled images in the color box and unlabeled images inside the parentheses, reflecting that the unlabeled class names are not available, only the images. The OAK setting introduces challenges beyond generalized category discovery (GCD), requiring adaptation to diverse ad-hoc categorization rules based on context.
  • Figure 3: OAK learns contextualized features while preserving the foundations of perception of CLIP by introducing context tokens that modulate the frozen ViT encoder, achieving context-aware attention. This contextualized feature learning follows two key principles: 1) top-down text guidance, which aligns visual clusters with semantic cues and refines clusters accordingly, and 2) bottom-up image clustering, which captures similarity based on visual cues and known class labels. This unified approach effectively combines the individual strengths of CLIP and GCD.
  • Figure 4: OAK attends to context-relevant regions of images, while CLIP and GCD often focus on arbitrary or less informative areas. We present saliency maps on a) the Stanford action, location, mood dataset and b) the Clevr-4 texture, color, shape, count dataset. We visualize saliency maps for CLIP, GCD, and OAK using the method of chefer2021transformer, guided by the predicted class, except for CLIP, where an empty string is used. Predictions are color-coded as correct and incorrect. On the Stanford dataset, OAK highlights human behaviors such as hand movements for action, captures the entire scene for location, and emphasizes the human face for mood, aligning well with human intuition. While GCD produces reasonable maps for some action examples, like phoning, it fails in cases like fixing a bike, mistakenly attending to the bike rather than the human action, confusing it with riding a bike. CLIP, on the other hand, consistently focuses on salient objects like humans but does not adapt its attention to different contexts.
  • Figure 5: OAK can effectively contextualize and switch between diverse contexts. We show t-SNE plots of CLIP and OAK with points colored by the ground-truth class in action: known images as stars, labeled images as red-bordered stars, and novel images as dots. While CLIP shows poor clustering under action, OAK's contextualized features form well-separated clusters. Additionally, images grouped closely in OAK under action become far apart in other contexts, underscoring the context-dependent interpretation and the context-switching ability of OAK.
  • ...and 14 more figures