Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

Mingzhe Li; Xin Lu; Yanyan Zhao

Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

Mingzhe Li, Xin Lu, Yanyan Zhao

TL;DR

Self-Foveate introduces a novel multi-level foveation framework (micro, scatter, macro) for unsupervised text instruction synthesis, augmented by a re-synthesis module to boost fidelity and quality. By explicitly extracting fine-grained details, cross-entity relationships, and holistic rhetorical patterns, the method generates instructions that are both more diverse and more difficult than existing baselines. Extensive experiments across multiple datasets and base models show consistent gains in diversity, difficulty, and downstream task performance, with ablations validating the necessity of each component and the regeneration step. The work demonstrates a scalable, automated approach to producing high-quality instruction data from unlabeled text and provides code for reproduction.

Abstract

Synthesizing high-quality instruction data from unsupervised text is a promising paradigm for training large language models (LLMs), yet automated methods for this task still exhibit significant limitations in the diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an LLM-driven method for instruction synthesis. Inspired by hierarchical human visual perception, Self-Foveate introduces a "Micro-Scatter-Macro" multi-level foveation methodology that guides the extraction of textual information at three complementary granularities, from fine-grained details through cross-region connections to holistic patterns, thereby enhancing both the diversity and difficulty of synthesized instructions. Furthermore, a re-synthesis module is incorporated to improve the fidelity of instructions to source text and their overall quality. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures demonstrate that Self-Foveate consistently outperforms existing methods. We publicly release our code at https://github.com/Mubuky/Self-Foveate

Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

TL;DR

Abstract

Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)