Table of Contents
Fetching ...

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

Donggyun Kim, Seongwoong Cho, Semin Kim, Chong Luo, Seunghoon Hong

TL;DR

Chameleon tackles the problem of data-efficient generalization for dense visual prediction across unseen tasks with varying output structures. It builds on Visual Token Matching by introducing a flexible multi-modal encoder and a hierarchical, task-adaptive matching mechanism, coupled with large-scale, diverse meta-training and model scaling. The approach yields strong performance across six real-world downstream tasks (video, 3D, medical, biological, and interactive) using only dozens of labeled examples, outperforming prior data-efficient generalists and approaching specialist baselines in several cases. This work advances practical applicability of dense-vision generalists by enabling flexible adaptation to new label structures and input modalities without task-specific supervision.

Abstract

Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training. However, constructing a data-efficient generalist for dense visual prediction presents a distinct challenge due to the variation in label structures across different tasks. Consequently, generalization to unseen dense prediction tasks in the low-data regime is not straightforward and has received less attention from previous vision generalists. In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples, enabling it to serve as a data-efficient vision generalist in diverse real-world scenarios. To this end, we base our method on a powerful meta-learning framework and explore several axes to improve its performance and versatility for real-world problems, such as flexible adaptation mechanisms and scalability. We evaluate our model across a spectrum of unseen real-world scenarios where low-shot learning is desirable, including video, 3D, medical, biological, and user-interactive tasks. Equipped with a generic architecture and an effective adaptation mechanism, our model flexibly adapts to all of these tasks with at most 50 labeled images, showcasing a significant advancement over existing data-efficient generalist approaches. Codes are available at https://github.com/GitGyun/chameleon.

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

TL;DR

Chameleon tackles the problem of data-efficient generalization for dense visual prediction across unseen tasks with varying output structures. It builds on Visual Token Matching by introducing a flexible multi-modal encoder and a hierarchical, task-adaptive matching mechanism, coupled with large-scale, diverse meta-training and model scaling. The approach yields strong performance across six real-world downstream tasks (video, 3D, medical, biological, and interactive) using only dozens of labeled examples, outperforming prior data-efficient generalists and approaching specialist baselines in several cases. This work advances practical applicability of dense-vision generalists by enabling flexible adaptation to new label structures and input modalities without task-specific supervision.

Abstract

Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training. However, constructing a data-efficient generalist for dense visual prediction presents a distinct challenge due to the variation in label structures across different tasks. Consequently, generalization to unseen dense prediction tasks in the low-data regime is not straightforward and has received less attention from previous vision generalists. In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples, enabling it to serve as a data-efficient vision generalist in diverse real-world scenarios. To this end, we base our method on a powerful meta-learning framework and explore several axes to improve its performance and versatility for real-world problems, such as flexible adaptation mechanisms and scalability. We evaluate our model across a spectrum of unseen real-world scenarios where low-shot learning is desirable, including video, 3D, medical, biological, and user-interactive tasks. Equipped with a generic architecture and an effective adaptation mechanism, our model flexibly adapts to all of these tasks with at most 50 labeled images, showcasing a significant advancement over existing data-efficient generalist approaches. Codes are available at https://github.com/GitGyun/chameleon.
Paper Structure (65 sections, 5 equations, 24 figures, 4 tables)

This paper contains 65 sections, 5 equations, 24 figures, 4 tables.

Figures (24)

  • Figure 1: Chameleon is a data-efficient generalist that can adapt to various unseen dense visual prediction tasks in the wild with arbitrary output structures using a handful of examples (dozens). It can also learn to utilize multi-modal inputs and user-interactions.
  • Figure 2: Existing generalist models struggles to learn out-of-distribution tasks of unseen label semantics (6D pose) or structure (animal keypoint) during training. ICL and PT denote in-context learning and prompt tuning is used for adaptation, respectively.
  • Figure 3: Encoding mechanism of the image encoder to handle multiple input images.
  • Figure 4: Task-adaptive feature re-weighting mechanism with a hierarchical architecture. The figure highlights the matching module at the third level of the hierarchy ($l=3$).
  • Figure 5: Summary of our meta-training dataset. Left: image domains (outer circle) and source datasets (inner circle). Sizes correspond to the dataset size. Right: task categories (inner circle) and specific tasks (outer circle). Sizes correspond to the sampling ratio.
  • ...and 19 more figures