Table of Contents
Fetching ...

ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, Han Xiao

TL;DR

This paper presents ReaderLM-v2, a compact 1.5B-parameter model trained to extract structured content from long HTML documents into Markdown or JSON formats. It introduces a novel three-stage Draft-Refine-Critique data synthesis pipeline and a four-stage training regimen (continued pretraining, supervised fine-tuning, direct preference optimization, and self-play iterative tuning) to handle long-context inputs up to 512K tokens. Empirical results show ReaderLM-v2 achieves competitive or superior performance to larger models on HTML-to-Markdown and robust JSON extraction under constrained resources, highlighting the viability of small LMs for structured web content tasks. The work emphasizes long-context processing, synthetic data quality control, and multi-stage optimization, and it is publicly available on Hugging Face for further research and deployment.

Abstract

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

TL;DR

This paper presents ReaderLM-v2, a compact 1.5B-parameter model trained to extract structured content from long HTML documents into Markdown or JSON formats. It introduces a novel three-stage Draft-Refine-Critique data synthesis pipeline and a four-stage training regimen (continued pretraining, supervised fine-tuning, direct preference optimization, and self-play iterative tuning) to handle long-context inputs up to 512K tokens. Empirical results show ReaderLM-v2 achieves competitive or superior performance to larger models on HTML-to-Markdown and robust JSON extraction under constrained resources, highlighting the viability of small LMs for structured web content tasks. The work emphasizes long-context processing, synthetic data quality control, and multi-stage optimization, and it is publicly available on Hugging Face for further research and deployment.

Abstract

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

Paper Structure

This paper contains 19 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: https://huggingface.co/jinaai/ReaderLM-v2's iterative training process. Our approach combines (a) a novel three-stage data synthesis pipeline (Draft-Refine-Critique) that generates high-quality training data, with (b) a comprehensive training strategy incorporating continuous pre-training, supervised fine-tuning, direct preference optimization, and self-play iterative tuning. The iterative nature of both components allows for continuous model improvement through cycles of data generation and model refinement.
  • Figure 2: Dataset statistics of WebMarkdown-1M.
  • Figure 3: The three-step data synthesis pipeline for https://huggingface.co/jinaai/ReaderLM-v2.