Table of Contents
Fetching ...

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Yapei Chang, Kyle Lo, Mohit Iyyer, Luca Soldaini

TL;DR

How2Everything presents a scalable, end-to-end framework to mine, evaluate, and improve real-world, goal-conditioned step-by-step procedures for LLMs. By integrating How2Mine for web-scale data collection, How2Bench for evaluation, How2Score for validity scoring, and How2Judge for scalable judging, together with RL using How2Score rewards, the approach demonstrates substantial improvements in procedural validity across models and training stages. The framework reveals clear scaling trends, robust cross-judge performance, and gains that persist on standard benchmarks, suggesting a practical closed-loop for capability evaluation and improvement guided by web-derived procedural data. Overall, the work shows that pretraining on web data can seed scalable evaluation and improvement loops for a core user-facing capability in LLMs: reliable, executable how-to procedures.

Abstract

Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

TL;DR

How2Everything presents a scalable, end-to-end framework to mine, evaluate, and improve real-world, goal-conditioned step-by-step procedures for LLMs. By integrating How2Mine for web-scale data collection, How2Bench for evaluation, How2Score for validity scoring, and How2Judge for scalable judging, together with RL using How2Score rewards, the approach demonstrates substantial improvements in procedural validity across models and training stages. The framework reveals clear scaling trends, robust cross-judge performance, and gains that persist on standard benchmarks, suggesting a practical closed-loop for capability evaluation and improvement guided by web-derived procedural data. Overall, the work shows that pretraining on web data can seed scalable evaluation and improvement loops for a core user-facing capability in LLMs: reliable, executable how-to procedures.

Abstract

Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.
Paper Structure (57 sections, 4 equations, 22 figures, 11 tables)

This paper contains 57 sections, 4 equations, 22 figures, 11 tables.

Figures (22)

  • Figure 1: Given a sample of 980K topic-stratified web documents, How2Mine yields 351K procedure instances (goal + resources list + reference steps), and can be easily scaled to larger corpora.
  • Figure 2: Agreement between LLM judges and the human majority label on critical-failure detection (N=200), reported overall and stratified by the human-majority class (has_failure/no_failure). §\ref{['sec:appendix_prompts_judging']} for the judge prompt; §\ref{['sec:appendix_human_annotation']} for annotation details.
  • Figure 3: How2Bench results on selected models. We report How2Score computed with How2Judge along with the average generated tokens for each model. The average reference length is 97.44 tokens. For open models, Base denotes the final non-post-trained checkpoint, and Instruct denotes the post-trained checkpoint.
  • Figure 4: Cross-judge robustness check for self-preference bias on closed models spanning the GPT, Gemini, and Claude families: we rescore the same generations with four judges (How2Judge, GPT 5, Gemini 2.5 Pro, Claude 4.5 Opus) and find the ranking is unchanged.
  • Figure 5: Post-training from different Olmo 3 7B pretraining checkpoints (x-axis). RL gains grow substantially at later checkpoints, while SFT yields modest improvements.
  • ...and 17 more figures