Table of Contents
Fetching ...

Text2Model: Text-based Model Induction for Zero-shot Image Classification

Ohad Amosy, Tomer Volk, Eilam Shapira, Eyal Ben-David, Roi Reichart, Gal Chechik

TL;DR

Text2Model introduces a hypernetwork-based framework that generates task-specific classifiers at inference from sets of natural language descriptions. By enforcing permutation-equivariance and invariant intermediate representations, the approach produces on-demand discriminators $f(\cdot;W)$ with weights $W=\tau_{\phi}(S^k)$ that adapt to the given task and descriptions. Across images, 3D point clouds, and action sequences, T2M-HN achieves state-of-the-art performance on zero-shot tasks with varying language richness, including negative attributes, while enabling lightweight, on-device models. This work lowers the data requirements for zero-shot learning by leveraging language structure and symmetry-aware architectures, offering practical impact for multimodal and edge applications.

Abstract

We address the challenge of building task-agnostic classifiers using only text descriptions, demonstrating a unified approach to image classification, 3D point cloud classification, and action recognition from scenes. Unlike approaches that learn a fixed representation of the output classes, we generate at inference time a model tailored to a query classification task. To generate task-based zero-shot classifiers, we train a hypernetwork that receives class descriptions and outputs a multi-class model. The hypernetwork is designed to be equivariant with respect to the set of descriptions and the classification layer, thus obeying the symmetries of the problem and improving generalization. Our approach generates non-linear classifiers, handles rich textual descriptions, and may be adapted to produce lightweight models efficient enough for on-device applications. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions: From single words to rich descriptions. Our results demonstrate strong improvements over previous approaches, showing that zero-shot learning can be applied with little training data. Furthermore, we conduct an analysis with foundational vision and language models, demonstrating that they struggle to generalize when describing what attributes the class lacks.

Text2Model: Text-based Model Induction for Zero-shot Image Classification

TL;DR

Text2Model introduces a hypernetwork-based framework that generates task-specific classifiers at inference from sets of natural language descriptions. By enforcing permutation-equivariance and invariant intermediate representations, the approach produces on-demand discriminators with weights that adapt to the given task and descriptions. Across images, 3D point clouds, and action sequences, T2M-HN achieves state-of-the-art performance on zero-shot tasks with varying language richness, including negative attributes, while enabling lightweight, on-device models. This work lowers the data requirements for zero-shot learning by leveraging language structure and symmetry-aware architectures, offering practical impact for multimodal and edge applications.

Abstract

We address the challenge of building task-agnostic classifiers using only text descriptions, demonstrating a unified approach to image classification, 3D point cloud classification, and action recognition from scenes. Unlike approaches that learn a fixed representation of the output classes, we generate at inference time a model tailored to a query classification task. To generate task-based zero-shot classifiers, we train a hypernetwork that receives class descriptions and outputs a multi-class model. The hypernetwork is designed to be equivariant with respect to the set of descriptions and the classification layer, thus obeying the symmetries of the problem and improving generalization. Our approach generates non-linear classifiers, handles rich textual descriptions, and may be adapted to produce lightweight models efficient enough for on-device applications. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions: From single words to rich descriptions. Our results demonstrate strong improvements over previous approaches, showing that zero-shot learning can be applied with little training data. Furthermore, we conduct an analysis with foundational vision and language models, demonstrating that they struggle to generalize when describing what attributes the class lacks.
Paper Structure (27 sections, 2 theorems, 7 equations, 8 figures, 8 tables)

This paper contains 27 sections, 2 theorems, 7 equations, 8 figures, 8 tables.

Key Result

Theorem 4.1

Let $f$ be a two-layer neural network $f(x)=W^{last}\sigma(W^{pen} x)$, whose weights are predicted by $\tau$$[W^{last}, W^{pen}] = \tau(S^k)$. If $\tau(S^k)$ is equivariant to a permutation $\mathcal{P}$ with respect to $W^{last}$, and invariant to $\mathcal{P}$ with respect to $W^{pen}$, then $f(

Figures (8)

  • Figure 1: The text-to-model (T2M) setup. (a) Classification tasks are described in rich language. (b) Traditional zero-shot methods produce static representations, shared for all tasks. (c) T2M generates task-specific representations and classifiers. This allows T2M to extract task-specific discriminative features.
  • Figure 2: The text-to-model learning problem and our architecture. Our model (yellow box) receives a set of class descriptions as input and outputs weights $w$ for a downstream on-demand model (orange). The model has two main blocks: A pretrained text encoder and a hypernetwork that obeys certain invariance and equivariance symmetries. The hypernetwork receives a set of dense descriptors to produce weights for the on-demand model.
  • Figure 3: (a) The T2M-HN architecture for equivariant-invariant hypernetwork. The input is processed by equivariant layers, followed by a prediction head for each layer of the target on-demand classifier $f$. The prediction head for $W_{last}$ is equivariant. Heads for earlier layers of $f$, $w_1, ...w_k$ are invariant. (b) An architecture for the equivariant layer. Every input is processed by a fully connected (FC) layer in a Siamese manner (shared weights). Inputs are also summed and processed by a second FC layer, whose output is added back to each output. (c) An architecture for an invariant layer, following a similar structure to b.
  • Figure 4: Classifying easy and hard pairs of bird species from the CUB dataset. Easy tasks involve binary classification of bird pairs from different taxonomy families. Hard tasks classify bird pairs within the same taxonomy family. Mean accuracy is shown for images from both seen (x-axis) and unseen (y-axis) classes, averaged across all pairs.
  • Figure 5: AUC of seen and unseen classes, in a one class task that crosses species boundaries: "Animals that have horns". Shown are averages over 53 attributes.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 4.1
  • Theorem D.1
  • proof