TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

Daniel Nobrega Medeiros

TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

Daniel Nobrega Medeiros

TL;DR

The TACIT Benchmark is introduced, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology.

Abstract

Existing visual reasoning benchmarks predominantly rely on natural language prompts, evaluate narrow reasoning modalities, or depend on subjective scoring procedures such as LLM-as-judge. We introduce the TACIT Benchmark, a programmatic visual reasoning benchmark comprising 10 tasks across 6 reasoning domains: spatial navigation, abstract pattern completion, causal simulation, logical constraint satisfaction, graph theory, and topology. The benchmark provides dual-track evaluation: a generative track in which models must produce solution images verified through deterministic computer-vision pipelines, and a discriminative track offering five-way multiple choice with structurally plausible near-miss distractors. Each distractor violates exactly one structural constraint, requiring models to reason about fine-grained visual differences rather than exploit superficial cues. Version 0.1.0 distributes 6,000 puzzles (108,000 PNG images across three resolutions) with fully deterministic seeded generation and reproducible verification. The dataset, generation code, and evaluation harness are released under the Apache 2.0 license on HuggingFace (DOI: 10.57967/hf/7904).

TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

TL;DR

Abstract

Paper Structure (50 sections, 4 figures, 5 tables)

This paper contains 50 sections, 4 figures, 5 tables.

Introduction
Related Work
Visual reasoning benchmarks.
Multimodal evaluation suites.
Programmatic and procedural benchmarks.
Positioning of TACIT.
Benchmark Design
Design Principles
Dual-Track Evaluation Architecture
Track 1: Generative evaluation.
Track 2: Discriminative evaluation.
Generator Protocol and Deterministic Seeding
Rendering Pipeline
Distractor System
Tasks
...and 35 more sections

Figures (4)

Figure 1: Dual-track evaluation architecture. Track 1 (generative) requires the model to produce a solution image verified by a deterministic CV pipeline. Track 2 (discriminative) presents five candidates for multiple-choice selection.
Figure 2: Representative puzzle instances from all 10 TACIT Benchmark tasks, grouped by domain. Row 1: Spatial and Pattern tasks. Row 2: Logical, Graph, and Topology tasks. Row 3: Geometric Projection tasks. Each puzzle is specified entirely through visual encoding with minimal text annotations.
Figure 3: Generation and verification pipeline. SVGs are generated deterministically, rasterized to multi-resolution PNGs, and verified against candidates using task-specific CV pipelines.
Figure 4: Puzzle--solution pairs for four representative tasks. (a--b) Multi-layer maze with blue path from start (green) to end (red). (c--d) Graph $k$-coloring: gray uncolored nodes and their proper coloring. (e--f) Orthographic projection: 3D voxel solid and its 2D silhouette. (g--h) Raven's matrix with missing tile completed.

TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

TL;DR

Abstract

TACIT Benchmark: A Programmatic Visual Reasoning Benchmark for Generative and Discriminative Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)