Table of Contents
Fetching ...

SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

Sifan Li, Yujun Cai, Yiwei Wang

TL;DR

Leading vision-language models fail to detect visually hidden content that requires perceptual adjustments, as shown on HC-Bench with 112 synthetic scenes. SemVink remedies this by downsampling images to 32–128 pixels, which suppresses redundant high-frequency cues and yields >99% accuracy, exposing a fundamental bias toward high-level semantics. The work advocates integrating multi-scale, low-level visual processing into multimodal architectures to improve robustness in real-world tasks like medical imaging and security. It also analyzes embedding redundancy to explain why high-resolution representations hinder hidden-content detection and discusses limitations due to synthetic data and preprocessing costs.

Abstract

Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.

SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking

TL;DR

Leading vision-language models fail to detect visually hidden content that requires perceptual adjustments, as shown on HC-Bench with 112 synthetic scenes. SemVink remedies this by downsampling images to 32–128 pixels, which suppresses redundant high-frequency cues and yields >99% accuracy, exposing a fundamental bias toward high-level semantics. The work advocates integrating multi-scale, low-level visual processing into multimodal architectures to improve robustness in real-world tasks like medical imaging and security. It also analyzes embedding redundancy to explain why high-resolution representations hinder hidden-content detection and discusses limitations due to synthetic data and preprocessing costs.

Abstract

Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability: detecting hidden content in optical illusions or AI-generated images through perceptual adjustments like zooming. We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions, revealing that leading VLMs achieve near-zero accuracy (0-5.36%)-even with explicit prompting. Humans resolve such ambiguities instinctively, yet VLMs fail due to an overreliance on high-level semantics. Strikingly, we propose SemVink (Semantic Visual Thinking) by simply scaling images to low resolutions (32-128 pixels), which unlocks >99% accuracy by eliminating redundant visual noise. This exposes a critical architectural flaw: VLMs prioritize abstract reasoning over low-level visual operations crucial for real-world robustness. Our work urges a shift toward hybrid models integrating multi-scale processing, bridging the gap between computational vision and human cognition for applications in medical imaging, security, and beyond.

Paper Structure

This paper contains 32 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illusional images can contain hidden texts or hidden images within the obvious background scenes.
  • Figure 2: As one of the best state-of-the-art VLMs, o4-mini is incapable in recognizing the hidden texts and objects within images even when we prompt directly with the correct answers. The hidden items in these images are "MARS", Colosseum, "YES", a cat, and "NEW YORK", respectively.
  • Figure 3: Two methods to help humans recognize the hidden content a Labrador retriever within the image: zoom out the image to a sight from a distance and squint to observe the image to reduce the brightness to highlight the hidden content.
  • Figure 4: The visualization of the embeddings of the input prompts with the image. In the conditions of the left one (6 consecutive image tokens as in the consecutive yellow region in the heatmap) and center one (10 consecutive image tokens), VLMs can recognize the hidden content. In the condition of the right one (666 consecutive image tokens), VLMs cannot find the hidden content. This demonstrates the redundant repeated information of the image is the key to obstruct finding the hidden content.