Table of Contents
Fetching ...

Assessing Graphical Perception of Image Embedding Models using Channel Effectiveness

Soohyun Lee, Minsuk Chang, Seokhyeon Park, Jinwook Seo

TL;DR

The paper tackles the problem of evaluating how vision systems perceive charts, arguing that existing benchmarks miss underlying perceptual mechanisms. It introduces a channel-effectiveness framework that separately analyzes channel accuracy via embedding linearity and discriminability via embedding distances, using CLIP as a testbed across six channels (length, tilt, area, luminance, saturation, curvature). Key findings show CLIP's channel accuracy order often diverges from human perception and that certain channels exhibit distinct discriminability patterns, with Weber's law-like behavior observed in some channels. The work proposes a foundational benchmark for reliable visual encoders in chart understanding and outlines directions to extend low-level perceptual evaluations to improve chart QA and captioning tasks.

Abstract

Recent advancements in vision models have greatly improved their ability to handle complex chart understanding tasks, like chart captioning and question answering. However, it remains challenging to assess how these models process charts. Existing benchmarks only roughly evaluate model performance without evaluating the underlying mechanisms, such as how models extract image embeddings. This limits our understanding of the model's ability to perceive fundamental graphical components. To address this, we introduce a novel evaluation framework to assess the graphical perception of image embedding models. For chart comprehension, we examine two main aspects of channel effectiveness: accuracy and discriminability of various visual channels. Channel accuracy is assessed through the linearity of embeddings, measuring how well the perceived magnitude aligns with the size of the stimulus. Discriminability is evaluated based on the distances between embeddings, indicating their distinctness. Our experiments with the CLIP model show that it perceives channel accuracy differently from humans and shows unique discriminability in channels like length, tilt, and curvature. We aim to develop this work into a broader benchmark for reliable visual encoders, enhancing models for precise chart comprehension and human-like perception in future applications.

Assessing Graphical Perception of Image Embedding Models using Channel Effectiveness

TL;DR

The paper tackles the problem of evaluating how vision systems perceive charts, arguing that existing benchmarks miss underlying perceptual mechanisms. It introduces a channel-effectiveness framework that separately analyzes channel accuracy via embedding linearity and discriminability via embedding distances, using CLIP as a testbed across six channels (length, tilt, area, luminance, saturation, curvature). Key findings show CLIP's channel accuracy order often diverges from human perception and that certain channels exhibit distinct discriminability patterns, with Weber's law-like behavior observed in some channels. The work proposes a foundational benchmark for reliable visual encoders in chart understanding and outlines directions to extend low-level perceptual evaluations to improve chart QA and captioning tasks.

Abstract

Recent advancements in vision models have greatly improved their ability to handle complex chart understanding tasks, like chart captioning and question answering. However, it remains challenging to assess how these models process charts. Existing benchmarks only roughly evaluate model performance without evaluating the underlying mechanisms, such as how models extract image embeddings. This limits our understanding of the model's ability to perceive fundamental graphical components. To address this, we introduce a novel evaluation framework to assess the graphical perception of image embedding models. For chart comprehension, we examine two main aspects of channel effectiveness: accuracy and discriminability of various visual channels. Channel accuracy is assessed through the linearity of embeddings, measuring how well the perceived magnitude aligns with the size of the stimulus. Discriminability is evaluated based on the distances between embeddings, indicating their distinctness. Our experiments with the CLIP model show that it perceives channel accuracy differently from humans and shows unique discriminability in channels like length, tilt, and curvature. We aim to develop this work into a broader benchmark for reliable visual encoders, enhancing models for precise chart comprehension and human-like perception in future applications.
Paper Structure (19 sections, 4 figures)

This paper contains 19 sections, 4 figures.

Figures (4)

  • Figure 1: Examples of our variations for each channel. Length and Area of 100% means the line or square fills the screen. Tilt is adjusted from 0$^\circ$ to 90$^\circ$. Curvature starts from a straight line to a semicircular arc. Color Hue is fixed to 0 (red) when the Luminance and Saturation increase from 0% to 100%.
  • Figure 2: Linearity of various visual channels across different CLIP models. The Y-axis is the channel arranged in the order that humans perceive more accurately. Each channel's linearity varies between models, which does not closely align with human perceptual accuracy. Also, the ViT-L/14@336px model usually shows better accuracy compared to other models.
  • Figure 3: The box plot illustrates the linearity scores for each channel under every combination of controlled variables, showcasing general patterns and deviations. Since area cannot be applied together with length or curvature, we have generated combinations without area. Based on the plot, all models have similar overall rankings for the channels they perceive (Color saturation $>$ Curvature $>$ Length $>$ Tilt $=$ Color luminance). The whisker of this boxplot represents the min and max of the full data.
  • Figure 4: The smoothed plot of the Euclidean distances between image embeddings for incremental changes in each visual channel. Sample images below the chart are illustrations of stimuli variations in each channel. Peaks represent thresholds where the model perceives significant differences between images, indicating the discriminability of each channel. This visualization aids in identifying how many perceptual groups the model can distinguish in each channel.