Table of Contents
Fetching ...

FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

Siyuan Huang, Ziyu Wang, Chao Pan, Han Zhao

TL;DR

FM SO.P addresses cross-domain SOP understanding by decoupling procedural reasoning into three progressive task types and pairing this with an automatic, domain-adaptive evaluation system. The framework combines Stage-wise contrastive data for concept disambiguation, sequential action understanding, and graph-based conditional reasoning, with cumulative data ensuring stability and transfer. An autonomous three-agent evaluation mechanism adapts rubrics, creates stratified tests, and scores outputs in a domain-aware manner, enabling scalable deployment across diverse SOP domains. Empirically, FM SO.P delivers substantial gains on SOPBench, with a 32B model achieving 48.3% pass rate—surpassing a 72B baseline—while 7B models reach competitive performance, demonstrating both effectiveness and parameter efficiency for enterprise SOP automation.

Abstract

Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.

FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

TL;DR

FM SO.P addresses cross-domain SOP understanding by decoupling procedural reasoning into three progressive task types and pairing this with an automatic, domain-adaptive evaluation system. The framework combines Stage-wise contrastive data for concept disambiguation, sequential action understanding, and graph-based conditional reasoning, with cumulative data ensuring stability and transfer. An autonomous three-agent evaluation mechanism adapts rubrics, creates stratified tests, and scores outputs in a domain-aware manner, enabling scalable deployment across diverse SOP domains. Empirically, FM SO.P delivers substantial gains on SOPBench, with a 32B model achieving 48.3% pass rate—surpassing a 72B baseline—while 7B models reach competitive performance, demonstrating both effectiveness and parameter efficiency for enterprise SOP automation.

Abstract

Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.
Paper Structure (36 sections, 9 equations, 5 figures, 2 tables)

This paper contains 36 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Progressive Task Mixture Architecture. Our three-stage framework builds SOP understanding through cumulative data. Stage 1 focuses on concept disambiguation using term substitution. Stage 2 adds action sequence understanding through procedural error injection (reordering, omission, insertion). Stage 3 incorporates scenario-aware graph reasoning with constraint violations (cycles, preconditions, invalid edges). Each stage retains all previous data ($\mathcal{D}^{(k)}_{\text{train}} = \bigcup_{i=1}^{k} \mathcal{D}_i$) and optimizes stage-specific contrastive losses, finally represented by $\min_{\theta}[\alpha_1\mathcal{L}_1 + \alpha_2\mathcal{L}_2 + \alpha_3\mathcal{L}_3]$.
  • Figure 2: Automatic Multi-Agent Evaluation System. Agent 1 analyzes SOP corpus to generate domain-adaptive rubrics $\mathcal{R}(\mathcal{D}) = \{(r_i, w_i)\}$ with dimensions and weights. Agent 2 creates stratified test set $\mathcal{B}$ across complexity levels, question types, and rubric coverage. Agent 3 computes multi-dimensional scores $\mathbf{e} = [e_1, \ldots, e_{|\mathcal{D}|}]^\top$ and aggregates as $e_{\text{final}} = \sum_i w_i \cdot e_i$, adapting to domain requirements.
  • Figure 3: Borda counts for models across domains. Higher score is better. FM SO.P variants trained with progressive task mixture achieve superior quality across all model sizes.
  • Figure 4: Negative sample ratio ablation across task mixture stages. Pass Rate (%) with varying positive:negative ratios.
  • Figure 5: Domain-Adaptive Rubric Analysis. Radar plots showing FM SO.P performance across generated rubrics for seven domains. Each domain has unique rubric dimensions (e.g., Banking: security protocol and financial accuracy; DMV: temporal reasoning and eligibility). Blue (7B), orange (14B), and green (32B) show scaling effects across model sizes.