TACIT: Transformation-Aware Capturing of Implicit Thought
Daniel Nobrega
TL;DR
We tackle visual reasoning without language by modeling the problem-to-solution transformation as a deterministic, pixel-space flow using rectified flow. TACIT learns a velocity field $v=f_\theta(x_t,t)$ that smoothly transports an unsolved maze image $x_0$ to its solution $x_1$ over $t\in[0,1]$ with 10 Euler steps, achieving a $192\times$ loss reduction and a $22.7\times$ improvement in $\text{L2}$ distance on 1 million maze pairs. A striking phase transition occurs at $t^*=0.70$, where the solution emerges within 2% of transformation time, and all samples exhibit simultaneous, holistic emergence, suggesting non-sequential, gestalt-like reasoning. The pixel-space, noise-free design enables direct visualization of the model's intermediate reasoning states, offering empirical insight into tacit knowledge and unconscious processing in neural networks. These findings imply that language-free visual reasoning, guided by interpretable transformations, can yield robust, holistic problem solving and pave the way for studying implicit cognitive strategies in neural systems.
Abstract
We present TACIT (Transformation-Aware Capturing of Implicit Thought), a diffusion-based transformer for interpretable visual reasoning. Unlike language-based reasoning systems, TACIT operates entirely in pixel space using rectified flow, enabling direct visualization of the reasoning process at each inference step. We demonstrate the approach on maze-solving, where the model learns to transform images of unsolved mazes into solutions. Key results on 1 million synthetic maze pairs include: - 192x reduction in training loss over 100 epochs - 22.7x improvement in L2 distance to ground truth - Only 10 Euler steps required (vs. 100-1000 for typical diffusion models) Quantitative analysis reveals a striking phase transition phenomenon: the solution remains invisible for 68% of the transformation (zero recall), then emerges abruptly at t=0.70 within just 2% of the process. Most remarkably, 100% of samples exhibit simultaneous emergence across all spatial regions, ruling out sequential path construction and providing evidence for holistic rather than algorithmic reasoning. This "eureka moment" pattern -- long incubation followed by sudden crystallization -- parallels insight phenomena in human cognition. The pixel-space design with noise-free flow matching provides a foundation for understanding how neural networks develop implicit reasoning strategies that operate below and before language.
