Table of Contents
Fetching ...

IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth

Md Touhidul Islam, Imran Kabir, Md Alimoor Reza, Syed Masum Billah

TL;DR

This paper tackles the challenge of evaluating open-vocabulary vision-language models in video object recognition where ground truth is unavailable. It introduces IKIWISI, a cognitive-audit interface that converts model outputs into a binary heat map of presence/absence, enabling users to identify reliability patterns without exhaustive inspection. A spy-object mechanism and a structured user study with 15 participants demonstrate that visual patterns correlate with objective $F_1$-based metrics when available, and that non-expert users can perform reliable assessments efficiently. The results suggest that human-patterned evaluation can complement automated metrics, enhance transparency, and democratize model assessment for real-world, context-specific deployments.

Abstract

We present IKIWISI ("I Know It When I See It"), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans' innate pattern recognition abilities to evaluate model reliability. IKIWISI introduces "spy objects": adversarial instances users know are absent, to discern models hallucinating on nonexistent items. The tool functions as a cognitive audit mechanism, surfacing mismatches between human and machine perception by visualizing where models diverge from human understanding. Our study with 15 participants found that users considered IKIWISI easy to use, made assessments that correlated with objective metrics when available, and reached informed conclusions by examining only a small fraction of heatmap cells. This approach not only complements traditional evaluation methods through visual assessment of model behavior with custom object sets, but also reveals opportunities for improving alignment between human perception and machine understanding in vision-language systems.

IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth

TL;DR

This paper tackles the challenge of evaluating open-vocabulary vision-language models in video object recognition where ground truth is unavailable. It introduces IKIWISI, a cognitive-audit interface that converts model outputs into a binary heat map of presence/absence, enabling users to identify reliability patterns without exhaustive inspection. A spy-object mechanism and a structured user study with 15 participants demonstrate that visual patterns correlate with objective -based metrics when available, and that non-expert users can perform reliable assessments efficiently. The results suggest that human-patterned evaluation can complement automated metrics, enhance transparency, and democratize model assessment for real-world, context-specific deployments.

Abstract

We present IKIWISI ("I Know It When I See It"), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans' innate pattern recognition abilities to evaluate model reliability. IKIWISI introduces "spy objects": adversarial instances users know are absent, to discern models hallucinating on nonexistent items. The tool functions as a cognitive audit mechanism, surfacing mismatches between human and machine perception by visualizing where models diverge from human understanding. Our study with 15 participants found that users considered IKIWISI easy to use, made assessments that correlated with objective metrics when available, and reached informed conclusions by examining only a small fraction of heatmap cells. This approach not only complements traditional evaluation methods through visual assessment of model behavior with custom object sets, but also reveals opportunities for improving alignment between human perception and machine understanding in vision-language systems.

Paper Structure

This paper contains 85 sections, 3 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Object Selection Panel (E) enlarged from Fig. \ref{['fig:dashboard_final']}. Objects prefixed with '*' and displayed in violet function as adversarial 'spy' instances (e.g., 'Chair' in this case) that test the model's ability to recognize object absence.
  • Figure 2: Binary Heat Map (F) enlarged from Fig. \ref{['fig:dashboard_final']}, showing the core visualization where green cells indicate objects the model recognizes and red cells represent objects it does not recognize.
  • Figure 3: IKIWISI's click-to-zoom feature in action. When a user clicks keyframe 2 (highlighted in Fig. \ref{['fig:dashboard_final']}), the system opens an enlarged view in the operating system's default image viewer (shown here in MacOS Preview). This external window allows users to inspect details, adjust magnification, and manipulate the view as needed for thorough analysis.
  • Figure 4: Example prompts to GPT4V (left column), GPV-1 (center column), and BLIP (right column), and the model generated responses for the first frame in Fig. \ref{['fig:dashboard_final']}. For one image, GPT4V was prompted once for a set of $N_o$ objects, and it responded with a dictionary, as shown in the left column. For one image, the other two models were prompted $N_o$ times, once for each object. Correct responses are in green, and incorrect ones are in red.
  • Figure 5: Early design 1: Two heat maps, two models, and the same objects. Users can pick two models and compare their heat maps for the same selected objects.
  • ...and 13 more figures