Text2Model: Text-based Model Induction for Zero-shot Image Classification
Ohad Amosy, Tomer Volk, Eilam Shapira, Eyal Ben-David, Roi Reichart, Gal Chechik
TL;DR
Text2Model introduces a hypernetwork-based framework that generates task-specific classifiers at inference from sets of natural language descriptions. By enforcing permutation-equivariance and invariant intermediate representations, the approach produces on-demand discriminators $f(\cdot;W)$ with weights $W=\tau_{\phi}(S^k)$ that adapt to the given task and descriptions. Across images, 3D point clouds, and action sequences, T2M-HN achieves state-of-the-art performance on zero-shot tasks with varying language richness, including negative attributes, while enabling lightweight, on-device models. This work lowers the data requirements for zero-shot learning by leveraging language structure and symmetry-aware architectures, offering practical impact for multimodal and edge applications.
Abstract
We address the challenge of building task-agnostic classifiers using only text descriptions, demonstrating a unified approach to image classification, 3D point cloud classification, and action recognition from scenes. Unlike approaches that learn a fixed representation of the output classes, we generate at inference time a model tailored to a query classification task. To generate task-based zero-shot classifiers, we train a hypernetwork that receives class descriptions and outputs a multi-class model. The hypernetwork is designed to be equivariant with respect to the set of descriptions and the classification layer, thus obeying the symmetries of the problem and improving generalization. Our approach generates non-linear classifiers, handles rich textual descriptions, and may be adapted to produce lightweight models efficient enough for on-device applications. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions: From single words to rich descriptions. Our results demonstrate strong improvements over previous approaches, showing that zero-shot learning can be applied with little training data. Furthermore, we conduct an analysis with foundational vision and language models, demonstrating that they struggle to generalize when describing what attributes the class lacks.
