Table of Contents
Fetching ...

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Chris Kelly, Luhui Hu, Bang Yang, Yu Tian, Deshun Yang, Cindy Yang, Zaoshan Huang, Zihao Li, Jiayin Hu, Yuexian Zou

TL;DR

VisionGPT proposes a generalized multimodal framework that uses an LLM as a central orchestrator to translate natural language requests into executable action proposals and coordinate multiple vision foundation models. The approach enables open-world visual perception and vision-language understanding by automating task decomposition, model selection, execution, and post-processing, with in-context and few-shot learning guiding action proposal generation. Experimental sections demonstrate text-conditioned image understanding and generation/editing by integrating YOLO, SAM, and diffusion models, highlighting the system’s adaptability to diverse tasks. While promising, the work notes limitations related to evolving foundation models and the complexity of coordinating multiple experts, pointing to future opportunities for continuous integration and personalization.

Abstract

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

TL;DR

VisionGPT proposes a generalized multimodal framework that uses an LLM as a central orchestrator to translate natural language requests into executable action proposals and coordinate multiple vision foundation models. The approach enables open-world visual perception and vision-language understanding by automating task decomposition, model selection, execution, and post-processing, with in-context and few-shot learning guiding action proposal generation. Experimental sections demonstrate text-conditioned image understanding and generation/editing by integrating YOLO, SAM, and diffusion models, highlighting the system’s adaptability to diverse tasks. While promising, the work notes limitations related to evolving foundation models and the complexity of coordinating multiple experts, pointing to future opportunities for continuous integration and personalization.

Abstract

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as text-conditioned image understanding/generation/editing and visual question answering. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision through enhanced efficiency, versatility, and generalization, and performance. Our code and models will be made publicly available. Keywords: VisionGPT, Open-world visual perception, Vision-language understanding, Large language model, and Foundation model
Paper Structure (13 sections, 2 equations, 7 figures, 1 table)

This paper contains 13 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An example of how YOLO and SAM collaborate to achieve instance segmentation. Given image (a) and interested objects "fork" and "person", YOLO is applied for detection, based on which SAM can solely segment regions of interest to improve efficiency.
  • Figure 2: Overview of VisionGPT. It leverages LLMs (e.g., Llama 2llama2) as the pivot to understand user intentions and state-of-the-art foundation models as executors to produce desired output. Building upon such a generalized multimodal framework, VisionGPT can accommodate any up-to-date foundation models and facilitate collaboration between models to address complex and challenging tasks.
  • Figure 3: An example of the workflow of VisionGPT, which can be broken down into four key steps: 1) request understanding via LLMs, 2) foundation model selection, 3) execution, and 4) post processing and integration.
  • Figure 4: Text-conditioned image understanding results. For each subfigure, the caption and the left image are user inputs whereas the right image is the final response of VisionGPT. As we can see, VisionGPT can follow user requirements to detect and segment correct objects, even though in some cases, i.e., (f), (g), and (h), users do not specify an object name.
  • Figure 5: Text-conditioned image generation/editing results. For each subfigure, the caption and the left image are user inputs whereas the right image is the final response of VisionGPT. As we can see, VisionGPT can generate image outputs aligned with user custom requests.
  • ...and 2 more figures