TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Hanning Chen; Wenjun Huang; Yang Ni; Sanggeon Yun; Yezi Liu; Fei Wen; Alvaro Velasquez; Hugo Latapie; Mohsen Imani

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Yezi Liu, Fei Wen, Alvaro Velasquez, Hugo Latapie, Mohsen Imani

TL;DR

TaskCLIP tackles task‑oriented object detection by decoupling general object detection from task‑driven selection, leveraging frozen vision–language models and a transformer aligner to bridge visual attributes (adjectives) with object patches. A learnable score function and a select‑by‑grouping mechanism address data imbalance and improve robustness, enabling effective generalization to new tasks via adjective attribute prompts generated by an LLM. Empirical results on COCO‑Tasks show TaskCLIP achieving competitive accuracy with substantially lower training cost than DETR‑based methods, and strong generalization on synonym tasks. The approach offers practical impact by providing a scalable, efficient way to perform task‑oriented detection without extensive end‑to‑end retraining for every new task.

Abstract

Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts. Nevertheless, the naive application of VLMs leads to sub-optimal quality, due to the misalignment between embeddings of object images and their visual attributes, which are mainly adjective phrases. To this end, we design a transformer-based aligner after the pre-trained VLMs to re-calibrate both embeddings. Finally, we employ a trainable score function to post-process the VLM matching results for object selection. Experimental results demonstrate that our TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by 3.5% and only requires a single NVIDIA RTX 4090 for both training and inference.

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

TL;DR

Abstract

Paper Structure (20 sections, 9 equations, 4 figures, 6 tables)

This paper contains 20 sections, 9 equations, 4 figures, 6 tables.

Introduction
Related Work
Task-oriented Object Detection
Vision Language Model
Method
Task visual attributes preparation
Text and visual embedding vector generation
Global attention
Vision and text embedding space recalibration
Score function and training
Select-by-grouping mechanism
Experiments
Datasets and Metric
Implementation Details
Comparisons with Previous Works
...and 5 more sections

Figures (4)

Figure 1: (a) Ambiguity of task-oriented object detection. (b) Data scarcity and imbalance: suitable objects only take up a tiny portion of the total training samples. (c) Embedding misalignment when directly applying VLMs for object selection.
Figure 2: (a) CoTDet tang2023cotdet workflow. (b). TaskCLIP (this work) workflow.
Figure 3: TaskCLIP model architecture: (a) The overall framework of the design. (b) The architecture of the aligner module. (c) Detailed design of the score function
Figure 4: Visualization for prediction results of the TaskCLIP (dash blue rectangle) and ground truth (solid red rectangle) (a). Examples with good performance. (b). Examples of unsatisfactory performance.

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

TL;DR

Abstract

TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)