Table of Contents
Fetching ...

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, Amjad Almahairi

TL;DR

VistaLLM presents a unified general-purpose vision-language model for coarse-to-fine reasoning and grounding across single- and multi-image inputs. It introduces an instruction-guided image tokenizer and a gradient-aware adaptive sampling scheme to serialize segmentation masks into sequences, enabling end-to-end generation with a Vicuna-based decoder. Trained on CoinIt (6.8 million samples) including AttCoSeg (804k) and evaluated on 15 VL benchmarks, VistaLLM achieves state-of-the-art results and strong generalization, notably across multi-image tasks like CoSeg and NLVR2. The approach reduces task-specific engineering, supports diverse input-output formats, and demonstrates practical impact for broad VL understanding and grounding tasks.

Abstract

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

TL;DR

VistaLLM presents a unified general-purpose vision-language model for coarse-to-fine reasoning and grounding across single- and multi-image inputs. It introduces an instruction-guided image tokenizer and a gradient-aware adaptive sampling scheme to serialize segmentation masks into sequences, enabling end-to-end generation with a Vicuna-based decoder. Trained on CoinIt (6.8 million samples) including AttCoSeg (804k) and evaluated on 15 VL benchmarks, VistaLLM achieves state-of-the-art results and strong generalization, notably across multi-image tasks like CoSeg and NLVR2. The approach reduces task-specific engineering, supports diverse input-output formats, and demonstrates practical impact for broad VL understanding and grounding tasks.

Abstract

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.
Paper Structure (22 sections, 1 equation, 19 figures, 13 tables, 1 algorithm)

This paper contains 22 sections, 1 equation, 19 figures, 13 tables, 1 algorithm.

Figures (19)

  • Figure 1: VistaLLM achieves the state-of-the-art performance across a broad range of single and multi-image coarse-to-fine grained reasoning and grounding tasks (see Table \ref{['tab:dataset_details']} for details) among general-purpose baselines. Notably, no existing baseline have unified segmentation and multi-image tasks in a single system. We show officially reported numbers for every baseline.
  • Figure 2: Overview of the proposed system - VistaLLM, which integrates single- and multi-image coarse- and fine-grained vision-language tasks into a unified general-purpose framework. VistaLLM contains three key design modules - ($i$) image encoder to extract the global image embedding, ($ii$) instruction-guided image tokenizer, which refines and compresses the global image embeddings using task instruction, enabling the model to filter the necessary visual information required for the current task, and ($iii$) LLM (Vicuna)-based decoder to jointly process image and language features, and generate the desired output. VistaLLM uses a gradient-aware adaptive sampling technique to efficiently represent segmentation masks as a point sequence, described in Section \ref{['sec:sequence_generation']}. All parameters except the image encoder are trained in stage 1, while only the image tokenizer is fine-tuned in stage 2 (See Section \ref{['sec:model_architecture']}, \ref{['sec:implementation_details']} for details).
  • Figure 3: Visualization of uniform and adaptive sampling strategies. (a) illustration of sampled points and comparison of reassembled curves, (b) illustration of sampled points and comparison of reassembled masks.
  • Figure 4: Ablative experiments on RES task. (a) Comparison of the highest possible mIoU by adaptive and uniform sampling, indicating lesser information loss in adaptive sampling, (b) Effect of number of sampled points on the performance of VistaLLM.
  • Figure 5: Examples demonstrating VistaLLM's capability for single and multi-image reasoning and grounding tasks. More visualizations are shown in supplementary. Best viewed when zoomed in and in color.
  • ...and 14 more figures