Table of Contents
Fetching ...

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing

TL;DR

This work introduces a multimodal textbook assembled from 2.5 years of instructional videos to pretrain vision–language models with richer foundational knowledge and tighter image–text–logic alignment. A knowledge-taxonomy-guided collection and a multi-level video-to-textbook pipeline (ASR, OCR, keyframes, and chronological interleaving) yield 6.5M keyframes and 0.75B text tokens across 75K videos. Pretraining with this textbook improves performance on knowledge- and reasoning-centric benchmarks (e.g., MathVista, ScienceQA) and enhances in-context learning by leveraging coherent interleaved context. Ablation studies confirm the value of ASR refinement, OCR, and SSIM-based keyframe extraction, and demonstrate the importance of maintaining temporal image–text coherence for effective multimodal pretraining.

Abstract

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

TL;DR

This work introduces a multimodal textbook assembled from 2.5 years of instructional videos to pretrain vision–language models with richer foundational knowledge and tighter image–text–logic alignment. A knowledge-taxonomy-guided collection and a multi-level video-to-textbook pipeline (ASR, OCR, keyframes, and chronological interleaving) yield 6.5M keyframes and 0.75B text tokens across 75K videos. Pretraining with this textbook improves performance on knowledge- and reasoning-centric benchmarks (e.g., MathVista, ScienceQA) and enhances in-context learning by leveraging coherent interleaved context. Ablation studies confirm the value of ASR refinement, OCR, and SSIM-based keyframe extraction, and demonstrate the importance of maintaining temporal image–text coherence for effective multimodal pretraining.

Abstract

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.
Paper Structure (28 sections, 2 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 2 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Previous interleaved datasets, e.g., MMC4 and OBELICS, suffer from limitations like weak text-image relations, low knowledge density, and incoherent image sequences. Our multimodal textbook, sourced from massive tutorial videos, employs coarse-to-fine knowledge extraction and multi-level filtering to create a high-quality, textbook-level dataset. It interleaves video keyframes with tutorial texts (extracted from ASR and OCR), enabling VLMs to acquire rich knowledge through tightly coupled text-image and more coherent logic.
  • Figure 2: An illustration of constructing a multimodal textbook from instructional videos. We first instruct LLMs to construct a knowledge taxonomy, then retrieve and filter videos at metadata level, collecting 159K instructional videos. Then a video-to-textbook pipeline is designed for multi-level knowledge extraction. ① We filter out non-instructional videos using ASR transcripts, retaining 75K high-quality videos. ② We use ASR's timestamp to segment long videos into short clips, discarding those with misaligned visuals and ASR. ③ We detect keyframes from each clip and extract text and symbols by OCR. Our pipeline produces 6.5M keyframes, 259M ASR, and 500M OCR tokens and organizes them into an image-text interleaved textbook.
  • Figure 3: We randomly select 20%, 50%, and 100% samples from datasets and shuffle the image order within each sample. These datasets with shuffled images are also used for pretraining. The Accuracy denotes the average of seven benchmarks.
  • Figure 4: Top: We plot six subjects along with their corresponding sub-courses. Due to space constraints, we selectively visualized only the courses with the highest proportions. Bottom: We count the knowledge points distribution belongs to each subject and its course
  • Figure 5: A case presented in our textbook illustrates the water cycle within the domain of earth science.
  • ...and 5 more figures