Table of Contents
Fetching ...

Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

Mingzhe Li, Xin Lu, Yanyan Zhao

TL;DR

Self-Foveate introduces a novel multi-level foveation framework (micro, scatter, macro) for unsupervised text instruction synthesis, augmented by a re-synthesis module to boost fidelity and quality. By explicitly extracting fine-grained details, cross-entity relationships, and holistic rhetorical patterns, the method generates instructions that are both more diverse and more difficult than existing baselines. Extensive experiments across multiple datasets and base models show consistent gains in diversity, difficulty, and downstream task performance, with ablations validating the necessity of each component and the regeneration step. The work demonstrates a scalable, automated approach to producing high-quality instruction data from unlabeled text and provides code for reproduction.

Abstract

Synthesizing high-quality instruction data from unsupervised text is a promising paradigm for training large language models (LLMs), yet automated methods for this task still exhibit significant limitations in the diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an LLM-driven method for instruction synthesis. Inspired by hierarchical human visual perception, Self-Foveate introduces a "Micro-Scatter-Macro" multi-level foveation methodology that guides the extraction of textual information at three complementary granularities, from fine-grained details through cross-region connections to holistic patterns, thereby enhancing both the diversity and difficulty of synthesized instructions. Furthermore, a re-synthesis module is incorporated to improve the fidelity of instructions to source text and their overall quality. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures demonstrate that Self-Foveate consistently outperforms existing methods. We publicly release our code at https://github.com/Mubuky/Self-Foveate

Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

TL;DR

Self-Foveate introduces a novel multi-level foveation framework (micro, scatter, macro) for unsupervised text instruction synthesis, augmented by a re-synthesis module to boost fidelity and quality. By explicitly extracting fine-grained details, cross-entity relationships, and holistic rhetorical patterns, the method generates instructions that are both more diverse and more difficult than existing baselines. Extensive experiments across multiple datasets and base models show consistent gains in diversity, difficulty, and downstream task performance, with ablations validating the necessity of each component and the regeneration step. The work demonstrates a scalable, automated approach to producing high-quality instruction data from unlabeled text and provides code for reproduction.

Abstract

Synthesizing high-quality instruction data from unsupervised text is a promising paradigm for training large language models (LLMs), yet automated methods for this task still exhibit significant limitations in the diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an LLM-driven method for instruction synthesis. Inspired by hierarchical human visual perception, Self-Foveate introduces a "Micro-Scatter-Macro" multi-level foveation methodology that guides the extraction of textual information at three complementary granularities, from fine-grained details through cross-region connections to holistic patterns, thereby enhancing both the diversity and difficulty of synthesized instructions. Furthermore, a re-synthesis module is incorporated to improve the fidelity of instructions to source text and their overall quality. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures demonstrate that Self-Foveate consistently outperforms existing methods. We publicly release our code at https://github.com/Mubuky/Self-Foveate

Paper Structure

This paper contains 72 sections, 3 equations, 3 figures, 12 tables, 3 algorithms.

Figures (3)

  • Figure 1: Illustration of (a) Self-Foveate in contrast with (b) Baseline Self-QA. For Self-Foveate, the multi-level foveation enables the LLM to extract details (highlighted in distinct colors) of the text, subsequently synthesizing instructions with diversity and difficulty via distinct synthesis paradigms. In comparison, Self-QA employs single-step generation that produces simple and monotonous instruction candidates.
  • Figure 2: The Self-Foveate workflow is designed for instruction synthesis based on unsupervised text. Self-Foveate takes unsupervised text as input, extracts foveate elements, foveate groups, and foveate segments, then synthesizes instruction tuning data through these extracted details.
  • Figure 3: Impact of instruction set scale on model fine-tuning performance for Self-Foveate and baselines: (a) Recall and (b) LLM Accuracy.