Table of Contents
Fetching ...

Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning

Nissan Yaron, Dan Bystritsky, Ben-Etzion Yaron

TL;DR

This work tackles factual grounding in language models by showing that a small 3.8B model (Humains-Junior) can achieve GPT-4o-level grounding on the FACTS benchmark within a practical margin of $\pm 5$ pp while delivering substantial cost savings (~19× cheaper on cloud) and near-zero marginal cost on edge hardware. The core idea, Exoskeleton Reasoning, combines a minimal directed validation scaffold with behavioral fine-tuning to enforce epistemic discipline, enabling reliable multi-step reasoning without requiring frontier-scale models. Across extensive ablations and multi-tier evaluations, the authors demonstrate a synergistic interaction where fine-tuning enables scaffold use, yielding up to $+17.7$ pp gains, and show substantial variance reduction for production-grade reliability. The results imply that directed reasoning and meta-cognitive scaffolds can deliver frontier-level factual reliability at far lower computational and financial costs, with practical deployment guidance and open-source reproducibility artifacts to accelerate adoption and further research.

Abstract

We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI

Humains-Junior: A 3.8B Language Model Achieving GPT-4o-Level Factual Accuracy by Directed Exoskeleton Reasoning

TL;DR

This work tackles factual grounding in language models by showing that a small 3.8B model (Humains-Junior) can achieve GPT-4o-level grounding on the FACTS benchmark within a practical margin of pp while delivering substantial cost savings (~19× cheaper on cloud) and near-zero marginal cost on edge hardware. The core idea, Exoskeleton Reasoning, combines a minimal directed validation scaffold with behavioral fine-tuning to enforce epistemic discipline, enabling reliable multi-step reasoning without requiring frontier-scale models. Across extensive ablations and multi-tier evaluations, the authors demonstrate a synergistic interaction where fine-tuning enables scaffold use, yielding up to pp gains, and show substantial variance reduction for production-grade reliability. The results imply that directed reasoning and meta-cognitive scaffolds can deliver frontier-level factual reliability at far lower computational and financial costs, with practical deployment guidance and open-source reproducibility artifacts to accelerate adoption and further research.

Abstract

We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI to ; permutation ; Cohen's ). TOST establishes equivalence at pp (not at pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, ) and reduce variance (). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, ); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within pp on Q1--Q500). Cloud pricing shows lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI

Paper Structure

This paper contains 104 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Baseline vs. Exoskeleton performance comparison across model families on FACTS Grounding.
  • Figure 2: Exoskeleton Reasoning vs Standard Prompting Architecture
  • Figure 3: Progressive Performance Across All Models.