TextualVerifier: Verify TextGrad Step-by-Step
Eugenius Mario Situmorang, Adila Alfa Krisnadhi, Ari Wibisono
TL;DR
TextualVerifier addresses a critical gap in TextGrad by introducing a self-verification framework that uses chain-of-thought decomposition, variant generation, majority voting, and consensus merging. It integrates verification at both the loss and optimization stages without requiring numerical gradients, leveraging LLMs to improve reasoning validity and reliability. Empirical results show statistically significant improvements across datasets, with the Loss-only verification configuration providing robust gains (e.g., +2.2 pp overall) and phase-1 phase improvements of up to 29% in reasoning validity. The work demonstrates the feasibility and value of LLM-based verification for text-based optimization and points to future avenues in multimodal verification and domain-adaptive strategies.
Abstract
TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.
