How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs
Yapei Chang, Kyle Lo, Mohit Iyyer, Luca Soldaini
TL;DR
How2Everything presents a scalable, end-to-end framework to mine, evaluate, and improve real-world, goal-conditioned step-by-step procedures for LLMs. By integrating How2Mine for web-scale data collection, How2Bench for evaluation, How2Score for validity scoring, and How2Judge for scalable judging, together with RL using How2Score rewards, the approach demonstrates substantial improvements in procedural validity across models and training stages. The framework reveals clear scaling trends, robust cross-judge performance, and gains that persist on standard benchmarks, suggesting a practical closed-loop for capability evaluation and improvement guided by web-derived procedural data. Overall, the work shows that pretraining on web data can seed scalable evaluation and improvement loops for a core user-facing capability in LLMs: reliable, executable how-to procedures.
Abstract
Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.
