Table of Contents
Fetching ...

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

Zhaoyang Zhang, Yantao Shen, Kunyu Shi, Zhaowei Cai, Jun Fang, Siqi Deng, Hao Yang, Davide Modolo, Zhuowen Tu, Stefano Soatto

TL;DR

Musketeer addresses the challenge of jointly training a single vision–language model across heterogeneous tasks without task-specific heads. It introduces Task Explanation Prompts (TEP), a structured textual prompt that encodes dataset, input/output formats, and task targets to instantiate task-specific processing within a shared encoder–decoder. Across seven tasks, Musketeer achieves competitive or superior performance to specialist models and outperforms baselines that use simpler prompts, with strong few-shot and zero-shot transfer demonstrated. The approach reduces architectural complexity while maintaining high performance, though it incurs a modest latency increase due to longer prompts, and it paves the way for scaling to stronger backbones and additional tasks.

Abstract

We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). With rich and structured information such as task input/output format, TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

TL;DR

Musketeer addresses the challenge of jointly training a single vision–language model across heterogeneous tasks without task-specific heads. It introduces Task Explanation Prompts (TEP), a structured textual prompt that encodes dataset, input/output formats, and task targets to instantiate task-specific processing within a shared encoder–decoder. Across seven tasks, Musketeer achieves competitive or superior performance to specialist models and outperforms baselines that use simpler prompts, with strong few-shot and zero-shot transfer demonstrated. The approach reduces architectural complexity while maintaining high performance, though it incurs a modest latency increase due to longer prompts, and it paves the way for scaling to stronger backbones and additional tasks.

Abstract

We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). With rich and structured information such as task input/output format, TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
Paper Structure (24 sections, 3 figures, 17 tables)

This paper contains 24 sections, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Example of TEP and baseline prompts for visual grounding. One-hot Prompt: representing task as a fixed vector. Base Prompt: standard prompting adopted by prior arts wang2022ofalu2022unified.
  • Figure 2: Pipeline overview of multi-tasking of Musketeer. "TEP-Task X" denotes Task Explanation Prompt (TEP) for a specific task, e.g., visual grounding. After Multi-task fine-tuning, Musketeer is capable of performing a variety of tasks under a single architecture and fully-shared parameters in a sequence-to-sequence manner. Each task is specified by a structural Task Explanation Prompt, which provides explicit instructions for conducting each specific task.
  • Figure 3: TEP subprompts' similarity matrices. They are constructed by computing cosine distances between TEP subprompts, which are obtained by inputting TEP subprompts into a language model. These matrices demonstrate the similarities among TEP subprompts across various tasks.