Table of Contents
Fetching ...

TACIT: Transformation-Aware Capturing of Implicit Thought

Daniel Nobrega

TL;DR

We tackle visual reasoning without language by modeling the problem-to-solution transformation as a deterministic, pixel-space flow using rectified flow. TACIT learns a velocity field $v=f_\theta(x_t,t)$ that smoothly transports an unsolved maze image $x_0$ to its solution $x_1$ over $t\in[0,1]$ with 10 Euler steps, achieving a $192\times$ loss reduction and a $22.7\times$ improvement in $\text{L2}$ distance on 1 million maze pairs. A striking phase transition occurs at $t^*=0.70$, where the solution emerges within 2% of transformation time, and all samples exhibit simultaneous, holistic emergence, suggesting non-sequential, gestalt-like reasoning. The pixel-space, noise-free design enables direct visualization of the model's intermediate reasoning states, offering empirical insight into tacit knowledge and unconscious processing in neural networks. These findings imply that language-free visual reasoning, guided by interpretable transformations, can yield robust, holistic problem solving and pave the way for studying implicit cognitive strategies in neural systems.

Abstract

We present TACIT (Transformation-Aware Capturing of Implicit Thought), a diffusion-based transformer for interpretable visual reasoning. Unlike language-based reasoning systems, TACIT operates entirely in pixel space using rectified flow, enabling direct visualization of the reasoning process at each inference step. We demonstrate the approach on maze-solving, where the model learns to transform images of unsolved mazes into solutions. Key results on 1 million synthetic maze pairs include: - 192x reduction in training loss over 100 epochs - 22.7x improvement in L2 distance to ground truth - Only 10 Euler steps required (vs. 100-1000 for typical diffusion models) Quantitative analysis reveals a striking phase transition phenomenon: the solution remains invisible for 68% of the transformation (zero recall), then emerges abruptly at t=0.70 within just 2% of the process. Most remarkably, 100% of samples exhibit simultaneous emergence across all spatial regions, ruling out sequential path construction and providing evidence for holistic rather than algorithmic reasoning. This "eureka moment" pattern -- long incubation followed by sudden crystallization -- parallels insight phenomena in human cognition. The pixel-space design with noise-free flow matching provides a foundation for understanding how neural networks develop implicit reasoning strategies that operate below and before language.

TACIT: Transformation-Aware Capturing of Implicit Thought

TL;DR

We tackle visual reasoning without language by modeling the problem-to-solution transformation as a deterministic, pixel-space flow using rectified flow. TACIT learns a velocity field that smoothly transports an unsolved maze image to its solution over with 10 Euler steps, achieving a loss reduction and a improvement in distance on 1 million maze pairs. A striking phase transition occurs at , where the solution emerges within 2% of transformation time, and all samples exhibit simultaneous, holistic emergence, suggesting non-sequential, gestalt-like reasoning. The pixel-space, noise-free design enables direct visualization of the model's intermediate reasoning states, offering empirical insight into tacit knowledge and unconscious processing in neural networks. These findings imply that language-free visual reasoning, guided by interpretable transformations, can yield robust, holistic problem solving and pave the way for studying implicit cognitive strategies in neural systems.

Abstract

We present TACIT (Transformation-Aware Capturing of Implicit Thought), a diffusion-based transformer for interpretable visual reasoning. Unlike language-based reasoning systems, TACIT operates entirely in pixel space using rectified flow, enabling direct visualization of the reasoning process at each inference step. We demonstrate the approach on maze-solving, where the model learns to transform images of unsolved mazes into solutions. Key results on 1 million synthetic maze pairs include: - 192x reduction in training loss over 100 epochs - 22.7x improvement in L2 distance to ground truth - Only 10 Euler steps required (vs. 100-1000 for typical diffusion models) Quantitative analysis reveals a striking phase transition phenomenon: the solution remains invisible for 68% of the transformation (zero recall), then emerges abruptly at t=0.70 within just 2% of the process. Most remarkably, 100% of samples exhibit simultaneous emergence across all spatial regions, ruling out sequential path construction and providing evidence for holistic rather than algorithmic reasoning. This "eureka moment" pattern -- long incubation followed by sudden crystallization -- parallels insight phenomena in human cognition. The pixel-space design with noise-free flow matching provides a foundation for understanding how neural networks develop implicit reasoning strategies that operate below and before language.
Paper Structure (73 sections, 12 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 73 sections, 12 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Tacit architecture overview. Input images (64$\times$64$\times$3) are tokenized into 64 patches via convolutional embedding, enriched with 2D sinusoidal positional encodings, and processed through 8 transformer blocks with adaptive layer normalization (adaLN) conditioned on the timestep embedding. The final layer reconstructs the predicted velocity field in pixel space. Orange arrows indicate timestep conditioning; the model learns to predict the direction from problem to solution at each point along the interpolation path.
  • Figure 2: Training loss over 100 epochs (log scale). The model exhibits three phases: rapid learning (epochs 1-25), refinement (epochs 25-60), and fine-tuning (epochs 60-100). Total loss reduction: 192$\times$ from epoch 5 to epoch 100.
  • Figure 3: Prediction quality measured as L2 distance to ground truth over training epochs. Lower values indicate better predictions. The model achieves 22.7$\times$ improvement from epoch 5 to epoch 100.
  • Figure 4: Visual evolution of model outputs across training. Each row shows a different maze sample. Columns show model predictions at epochs 5, 10, 25, 50, 75, and 100 (left to right). Early epochs produce blurry outputs, while later epochs produce accurate solutions with the correct path marked in red.
  • Figure 5: Inference trajectory visualization across 50 Euler steps for 8 different mazes. Each row shows a sample's transformation from $t=0$ (input) to $t=1$ (solution). The solution path (red) remains invisible until approximately $t=0.7$, then emerges abruptly and simultaneously across the entire trajectory. This pattern---long incubation followed by sudden crystallization---is consistent across all samples.
  • ...and 2 more figures