Table of Contents
Fetching ...

MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models

Ming-Chang Chiu, Shicheng Wen, Pin-Yu Chen, Xuezhe Ma

TL;DR

MegaCoin addresses a critical gap in vision-language research by providing 220k real images annotated with medium-grained foreground color, background color, and physical environment, totaling 660k labels. It supports both multimodal instruction tuning (MegaCoin-Instruct) and a robust evaluation benchmark (MegaCoin-Bench), and enables domain-generalization studies with a tiered QA design (Tiered-MQA). Empirical results show that fine-tuning with MegaCoin-Instruct can enable open-source VLMs like LLaVA-1.5 and Bunny-1.1 to outperform GPT-4o on several tasks, while also revealing persistent weaknesses in color-perception and environmental reasoning. The work demonstrates MegaCoin's utility for advancing VLM alignment and for benchmarking diverse domain-generalization algorithms, with clear implications for improving robustness across real-world visual contexts.

Abstract

In vision-language models (VLMs), the ability to perceive and interpret color and physical environment is crucial for achieving contextually accurate understanding and interaction. However, despite advances in multimodal modeling, there remains a significant lack of specialized datasets that rigorously evaluate a model's capacity to discern subtle color variations and spatial context -- critical elements for situational comprehension and reliable deployment across real-world applications. Toward that goal, we curate MegaCOIN, a high-quality, human-labeled dataset based on \emph{real} images with various contextual attributes. MegaCOIN consists of two parts: MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000 real images: foreground color, background color, and description of an object's physical environment, constituting 660k human annotations. In addition, MegaCOIN can be applied to benchmark domain generalization (DG) algorithms. We explore benchmarking DG methods in the linear probing setup for VLM and show some new insights. Last but not least, we show that VLMs, including GPT-4o, have subpar color recognition capabilities, and fine-tuning with MegaCOIN can result in improved performance on visual evaluation tasks. In certain cases, MegaCOIN fine-tuned small-scale opensource models such as LLaVA and Bunny can outperform closed-source GPT-4o. We hope the utilities of MegaCOIN can shed light on the directions VLMs can improve and provide a more complex platform for domain generalization algorithms.

MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models

TL;DR

MegaCoin addresses a critical gap in vision-language research by providing 220k real images annotated with medium-grained foreground color, background color, and physical environment, totaling 660k labels. It supports both multimodal instruction tuning (MegaCoin-Instruct) and a robust evaluation benchmark (MegaCoin-Bench), and enables domain-generalization studies with a tiered QA design (Tiered-MQA). Empirical results show that fine-tuning with MegaCoin-Instruct can enable open-source VLMs like LLaVA-1.5 and Bunny-1.1 to outperform GPT-4o on several tasks, while also revealing persistent weaknesses in color-perception and environmental reasoning. The work demonstrates MegaCoin's utility for advancing VLM alignment and for benchmarking diverse domain-generalization algorithms, with clear implications for improving robustness across real-world visual contexts.

Abstract

In vision-language models (VLMs), the ability to perceive and interpret color and physical environment is crucial for achieving contextually accurate understanding and interaction. However, despite advances in multimodal modeling, there remains a significant lack of specialized datasets that rigorously evaluate a model's capacity to discern subtle color variations and spatial context -- critical elements for situational comprehension and reliable deployment across real-world applications. Toward that goal, we curate MegaCOIN, a high-quality, human-labeled dataset based on \emph{real} images with various contextual attributes. MegaCOIN consists of two parts: MegaCOIN-Instruct, which serves as a supervised fine-tuning (SFT) dataset for VLMs; and MegaCOIN-Bench, an annotated test set that can be used as a stand-alone QA dataset. MegaCOIN~provides three annotated features for 220,000 real images: foreground color, background color, and description of an object's physical environment, constituting 660k human annotations. In addition, MegaCOIN can be applied to benchmark domain generalization (DG) algorithms. We explore benchmarking DG methods in the linear probing setup for VLM and show some new insights. Last but not least, we show that VLMs, including GPT-4o, have subpar color recognition capabilities, and fine-tuning with MegaCOIN can result in improved performance on visual evaluation tasks. In certain cases, MegaCOIN fine-tuned small-scale opensource models such as LLaVA and Bunny can outperform closed-source GPT-4o. We hope the utilities of MegaCOIN can shed light on the directions VLMs can improve and provide a more complex platform for domain generalization algorithms.

Paper Structure

This paper contains 37 sections, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Overview of MegaCoin. (a) Examples of our human-annotated MegaCoin, consisting of three distinct attributes, foreground/background color, physical environment. (b & c) We use MegaCoin as an instruction fine-tuning data (MegaCoin-Instruct) and a benchmark (MegaCoin-Bench). (b) Examples of 3-tier MegaCoin-Bench evaluation for a single image. (c) Example of MegaCoin-Instruct SFT pairs for a single image.
  • Figure 2: Failure cases on MegaCoin-Bench before/after training with MegaCoin-Instruct. After fine-tuning with MegaCoin-Instruct, we are able to have the VLMs recognize the correct colors and environments.