Table of Contents
Fetching ...

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Haoxuan Han, Weijie Wang, Zeyu Zhang, Yefei He, Bohan Zhuang

Abstract

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

Less Detail, Better Answers: Degradation-Driven Prompting for VQA

Abstract

Recent advancements in Vision-Language Models (VLMs) have significantly pushed the boundaries of Visual Question Answering (VQA).However,high-resolution details can sometimes become noise that leads to hallucinations or reasoning errors. In this paper,we propose Degradation-Driven Prompting (DDP), a novel framework that improves VQA performance by strategically reducing image fidelity to force models to focus on essential structural information. We evaluate DDP across two distinct tasks. Physical attributes targets images prone to human misjudgment, where DDP employs a combination of 80p downsampling, structural visual aids (white background masks and orthometric lines), and In-Context Learning (ICL) to calibrate the model's focus. Perceptual phenomena addresses various machine-susceptible visual anomalies and illusions, including Visual Anomaly (VA), Color (CI), Motion(MI),Gestalt (GI), Geometric (GSI), and Visual Illusions (VI).For this task, DDP integrates a task-classification stage with specialized tools such as blur masks and contrast enhancement alongside downsampling. Our experimental results demonstrate that less is more: by intentionally degrading visual inputs and providing targeted structural prompts, DDP enables VLMs to bypass distracting textures and achieve superior reasoning accuracy on challenging visual benchmarks.

Paper Structure

This paper contains 15 sections, 11 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Less is More in Visual Perception. Comparison between a standard high-resolution pipeline and our proposed DDP strategy. While high-resolution inputs ($500 \times 400$p) can paradoxically lead to misinterpretation (e.g., identifying "18" instead of "73"), our DDP leverages lower resolution ($80 \times 64$p) to eliminate background noise. This approach achieves a 50% reduction in response time and a 50% improvement in accuracy for basic physical attribute tasks by focusing the model on essential structural information.
  • Figure 2: Overcoming visual reasoning bottlenecks via the DDP framework. Standard VLMs are easily deceived by optical illusions or occlusions (e.g., a dog seemingly split by a tree). Our DDP approach introduces a "divide-and-conquer" strategy: the classifier categorizes the image type, the tool manager invokes specialized visual tools (e.g., draw_rectangle and crop) to highlight suspicious regions, and the Critic synthesizes these visual cues. The Critic detects that the front and back portions mismatch the trunk width, thereby correcting the initial misconception and providing the right answer (Option C: 2).
  • Figure 3: Overview of the DDP-based VLM enhancement framework. The workflow consists of three primary stages: (1) classifier: The input image and choice question are categorized into preset domains (e.g., motion illustration, colorblindness, or color attributes) to guide subsequent tool selection. (2) tool manager: After an initial noise-removal downsampling, a DDP agent performs iterative function calls to select specialized visual tools from magazines, such as Polar/Cartesian auxiliary lines, crops, and masks, to highlight key features. (3) target prompting: Despite the extreme downsampling (structural bottleneck $<$ 80p), the processed image is fed into the Critic module. By leveraging task-specific prompts and auxiliary tips based on the ClassID, the system generates a structured Chain-of-Thought (CoT) reasoning process to produce the final option.
  • Figure 4: A case study demonstrating how DDP leverages external tools to solve visual perception bottlenecks. The pipeline iteratively classifies the task, invokes specific image-processing tools (blurring and contrast enhancement), and utilizes the resulting "degraded" yet cleaner visual features to perform robust reasoning on perception-intensive tasks.