Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

Zhaoyang Zhang; Yantao Shen; Kunyu Shi; Zhaowei Cai; Jun Fang; Siqi Deng; Hao Yang; Davide Modolo; Zhuowen Tu; Stefano Soatto

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

Zhaoyang Zhang, Yantao Shen, Kunyu Shi, Zhaowei Cai, Jun Fang, Siqi Deng, Hao Yang, Davide Modolo, Zhuowen Tu, Stefano Soatto

TL;DR

Musketeer addresses the challenge of jointly training a single vision–language model across heterogeneous tasks without task-specific heads. It introduces Task Explanation Prompts (TEP), a structured textual prompt that encodes dataset, input/output formats, and task targets to instantiate task-specific processing within a shared encoder–decoder. Across seven tasks, Musketeer achieves competitive or superior performance to specialist models and outperforms baselines that use simpler prompts, with strong few-shot and zero-shot transfer demonstrated. The approach reduces architectural complexity while maintaining high performance, though it incurs a modest latency increase due to longer prompts, and it paves the way for scaling to stronger backbones and additional tasks.

Abstract

We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). With rich and structured information such as task input/output format, TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

TL;DR

Abstract

Paper Structure (24 sections, 3 figures, 17 tables)

This paper contains 24 sections, 3 figures, 17 tables.

Introduction
Key Contributions in Relation to Prior Work
Other Related Work in Broader Context
Musketeer
Tasks & Datasets
Task Explanation Prompt
Architecture & Training
Task similarity matrices for TEP and other subpromts
Experiments
Training Dataset Composition
Experimental Setup
Effectiveness of Musketeer
Comparison with state-of-the-art methods
Few-shot finetuning results
Zero-shot finetuning results
...and 9 more sections

Figures (3)

Figure 1: Example of TEP and baseline prompts for visual grounding. One-hot Prompt: representing task as a fixed vector. Base Prompt: standard prompting adopted by prior arts wang2022ofalu2022unified.
Figure 2: Pipeline overview of multi-tasking of Musketeer. "TEP-Task X" denotes Task Explanation Prompt (TEP) for a specific task, e.g., visual grounding. After Multi-task fine-tuning, Musketeer is capable of performing a variety of tasks under a single architecture and fully-shared parameters in a sequence-to-sequence manner. Each task is specified by a structural Task Explanation Prompt, which provides explicit instructions for conducting each specific task.
Figure 3: TEP subprompts' similarity matrices. They are constructed by computing cosine distances between TEP subprompts, which are obtained by inputting TEP subprompts into a language model. These matrices demonstrate the similarities among TEP subprompts across various tasks.

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

TL;DR

Abstract

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (3)