ReaderLM-v2: Small Language Model for HTML to Markdown and JSON
Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, Han Xiao
TL;DR
This paper presents ReaderLM-v2, a compact 1.5B-parameter model trained to extract structured content from long HTML documents into Markdown or JSON formats. It introduces a novel three-stage Draft-Refine-Critique data synthesis pipeline and a four-stage training regimen (continued pretraining, supervised fine-tuning, direct preference optimization, and self-play iterative tuning) to handle long-context inputs up to 512K tokens. Empirical results show ReaderLM-v2 achieves competitive or superior performance to larger models on HTML-to-Markdown and robust JSON extraction under constrained resources, highlighting the viability of small LMs for structured web content tasks. The work emphasizes long-context processing, synthetic data quality control, and multi-stage optimization, and it is publicly available on Hugging Face for further research and deployment.
Abstract
We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.
