Table of Contents
Fetching ...

TextualVerifier: Verify TextGrad Step-by-Step

Eugenius Mario Situmorang, Adila Alfa Krisnadhi, Ari Wibisono

TL;DR

TextualVerifier addresses a critical gap in TextGrad by introducing a self-verification framework that uses chain-of-thought decomposition, variant generation, majority voting, and consensus merging. It integrates verification at both the loss and optimization stages without requiring numerical gradients, leveraging LLMs to improve reasoning validity and reliability. Empirical results show statistically significant improvements across datasets, with the Loss-only verification configuration providing robust gains (e.g., +2.2 pp overall) and phase-1 phase improvements of up to 29% in reasoning validity. The work demonstrates the feasibility and value of LLM-based verification for text-based optimization and points to future avenues in multimodal verification and domain-adaptive strategies.

Abstract

TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.

TextualVerifier: Verify TextGrad Step-by-Step

TL;DR

TextualVerifier addresses a critical gap in TextGrad by introducing a self-verification framework that uses chain-of-thought decomposition, variant generation, majority voting, and consensus merging. It integrates verification at both the loss and optimization stages without requiring numerical gradients, leveraging LLMs to improve reasoning validity and reliability. Empirical results show statistically significant improvements across datasets, with the Loss-only verification configuration providing robust gains (e.g., +2.2 pp overall) and phase-1 phase improvements of up to 29% in reasoning validity. The work demonstrates the feasibility and value of LLM-based verification for text-based optimization and points to future avenues in multimodal verification and domain-adaptive strategies.

Abstract

TextGrad is a novel approach to text-based automatic differentiation that enables composite AI systems to perform optimization without explicit numerical equations. However, it currently lacks self-verification mechanisms that ensure reasoning validity in text-based decision making. This research introduces TextualVerifier, a verification framework that leverages chain-of-thought reasoning and majority voting with large language models to address this verification gap. TextualVerifier implements a four-stage workflow: chain-of-thought decomposition, variant generation, majority voting, and consensus aggregation. It integrates non-invasively with TextGrad at both the loss function and optimization result verification stages. Experimental evaluation using the Gemini 1.5 Pro model is conducted in two phases: (1) standalone evaluation on PRM800K, and (2) integrated evaluation with TextGrad on GPQA-Diamond, MMLU-ML, and MMLU-CP benchmarks. Results show statistically significant improvements (p < 0.001). In phase one, TextualVerifier improves the validity of reasoning steps by 29 percent. In phase two, integration into TextGrad loss function yields a 2.2 percentage point gain from 68.2 to 70.4 percent with a moderate overhead of 5.9 LLM calls on average. Further evaluations of TextualVerifier versioning yield 8.08, 10.71, and 3.92 percentage point improvements on GPQA, MMLU-ML, and MMLU-CP respectively. TextualVerifier thus presents the first self-verification framework for TextGrad through LLM-based techniques without requiring numerical gradients, enabling more reliable reasoning and opening new directions for verification in text-based optimization.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: TextualVerifier Architecture showing the four-stage verification workflow: (1) Chain-of-Thought Decomposition, (2) Step Breakdown and Extraction, (3) Variant Generation with Multiple Perspectives, and (4) Majority Voting and Consensus Aggregation.
  • Figure 2: Integration points with TextGrad showing Loss Function Verification and Optimization Phase Verification within the TextGrad optimization workflow.
  • Figure 3: Experiment Phase 1 High-Level Flow using PRM800K dataset.
  • Figure 4: Experiment Phase 2 High-Level Flow using GPQA-Diamond, MMLU-ML, and MMLU-CP datasets.