Table of Contents
Fetching ...

CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

Jinxiang Xie, Zihao Li, Wei He, Rui Ding, Shi Han, Dongmei Zhang

TL;DR

This paper tackles the instability of LLM-generated text analyses in tabular data contexts by formalizing Text Analysis for Data Analytics (TADA) and introducing CAST, a framework that constrains latent reasoning via Algorithmic Prompting (AP) and Thinking-before-Speaking (TbS). CAST reduces the entropy of reasoning paths, yielding more reproducible corpus-level summaries and row-level tags (CAST-S and CAST-T) while maintaining or improving accuracy. The authors formalize a probabilistic view of stability, propose concrete design principles, and implement CAST as a single call with explicit intermediate commitments and self-validation. Across multilingual summarization and diverse tagging tasks, CAST demonstrates superior stability across multiple LLM backbones, with strong alignment to human judgments and competitive efficiency, indicating practical potential for dependable LLM-driven data analytics pipelines.

Abstract

Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2\%, while maintaining or improving output quality.

CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

TL;DR

This paper tackles the instability of LLM-generated text analyses in tabular data contexts by formalizing Text Analysis for Data Analytics (TADA) and introducing CAST, a framework that constrains latent reasoning via Algorithmic Prompting (AP) and Thinking-before-Speaking (TbS). CAST reduces the entropy of reasoning paths, yielding more reproducible corpus-level summaries and row-level tags (CAST-S and CAST-T) while maintaining or improving accuracy. The authors formalize a probabilistic view of stability, propose concrete design principles, and implement CAST as a single call with explicit intermediate commitments and self-validation. Across multilingual summarization and diverse tagging tasks, CAST demonstrates superior stability across multiple LLM backbones, with strong alignment to human judgments and competitive efficiency, indicating practical potential for dependable LLM-driven data analytics pipelines.

Abstract

Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2\%, while maintaining or improving output quality.
Paper Structure (51 sections, 12 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 51 sections, 12 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of the summarization and tagging operations for TADA. These atomic operations can be composed and reused in complex TADA tasks.
  • Figure 2: Overview of the CAST framework. Top Left: Traditional methods, such as Few-shot CoT, Tree of Thoughts (ToT) and Self-Consistency (SC), operate with uncontrolled reasoning paths, resulting in wide, high-entropy output distributions. Top Right: CAST mitigates this instability via Algorithmic Prompting and Thinking-before-Speaking. By analyzing intermediate consistency and enforcing stable steps as hard constraints, CAST effectively collapses the generation process into a sharply concentrated output distribution. Bottom: The proposed stability evaluation suite, which rigorously quantifies consistency using a hybrid metric of Semantic Score ($S_{sem}$) for content overlap and Positional Score ($S_{pos}$) derived from Kendall's Tau ($\tau$).
  • Figure 3: Output-length stability under different prompting strategies. KDE-smoothed distributions of summary length (word count) compare (i) direct prompting with no intermediate state, (ii) prompting that elicits an irrelevant intermediate state, (iii) prompting that elicits a relevant intermediate state, and (iv) the full CAST prompt. Irrelevant intermediate states yield broader and more diffuse distributions, while relevant intermediate states partially tighten the spread. The CAST prompt produces the sharpest and most concentrated distribution, indicating substantially improved run-to-run stability with outputs tightly clustered around a central value. Shaded bands mark the central 10-90% mass, and dashed lines denote medians.
  • Figure 4: The CAST framework for tagging, illustrating a pipeline that begins with query decomposition and domain identification to guide the core algorithmic prompting stage, and concludes with output validation.
  • Figure 7: Visualization of reasoning path convergence. Unconstrained reasoning paths (orange), where the LLM freely chooses its intermediate steps, are highly divergent. In contrast, paths guided by the CAST framework (blue) converge into a single, structured sequence, demonstrating a significant reduction in reasoning path entropy.

Theorems & Definitions (1)

  • Definition 1: Output Stability