Table of Contents
Fetching ...

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, Hongsheng Li

TL;DR

The paper tackles the fragmentation of vision tasks under rigid, template-based prompts by introducing InstructSeq, a unified, instruction-conditioned framework that handles both dense (pixel-level) and textual outputs. It combines a ViT-based visual encoder, a frozen text encoder, and an autoregressive transformer, trained on a large set of natural-language instructions generated by an external LLM to follow flexible directives. Across semantic segmentation, referring expression segmentation/comprehension, and image captioning, InstructSeq achieves competitive or superior results without task-specific tuning, demonstrating strong generalization to new instructions and open-vocabulary categories. The authors also introduce sampling-based prediction and confidence estimation, highlighting practical benefits for robustness and reliability in multi-task vision systems.

Abstract

Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

TL;DR

The paper tackles the fragmentation of vision tasks under rigid, template-based prompts by introducing InstructSeq, a unified, instruction-conditioned framework that handles both dense (pixel-level) and textual outputs. It combines a ViT-based visual encoder, a frozen text encoder, and an autoregressive transformer, trained on a large set of natural-language instructions generated by an external LLM to follow flexible directives. Across semantic segmentation, referring expression segmentation/comprehension, and image captioning, InstructSeq achieves competitive or superior results without task-specific tuning, demonstrating strong generalization to new instructions and open-vocabulary categories. The authors also introduce sampling-based prediction and confidence estimation, highlighting practical benefits for robustness and reliability in multi-task vision systems.

Abstract

Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.
Paper Structure (16 sections, 3 figures, 6 tables)

This paper contains 16 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: InstructSeq comprises a visual encoder, a frozen instruction encoder, and an autoregressive transformer. The visual encoder extracts image features while the instruction encoder encodes free-form textual instructions. These representations are input to the transformer which generates discrete token sequences. This architecture allows producing various output types to accomplish diverse vision tasks based on natural language directives.
  • Figure 2: Qualitative results of InstructSeq across all vision tasks.
  • Figure 3: Left: semantic segmentation map generated from the InstructSeq model. Right: confidence map obtained during InstructSeq token sampling. The yellow areas denote low-confidence areas and the purple areas denote high-confidence areas.