Table of Contents
Fetching ...

PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking

Markus J. Buehler

TL;DR

This work introduces PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning), a framework that integrates preference optimization with reinforcement learning concepts for self-improving scientific reasoning, and finds that even small models self-teach deeper reasoning, solving open-domain problems effectively.

Abstract

PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning) combines preference optimization with concepts from Reinforcement Learning to enable models to self-teach through iterative reasoning improvements. We propose a recursive learning approach that engages the model in multi-step reasoning, revisiting, and refining intermediate steps before producing a final output in training and inference phases. Through multiple training stages, the model first learns to align its reasoning with accurate decision paths by optimizing the log odds between preferred and non-preferred responses. During this process, PRefLexOR builds a dynamic knowledge graph by generating questions from random text chunks and retrieval-augmentation to contextualize relevant details from the entire training corpus. In the second stage, preference optimization enhances model performance by using rejection sampling to fine-tune reasoning quality by continually producing in-situ training data while masking the reasoning steps. Recursive optimization within a thinking token framework introduces iterative feedback loops, where the model refines reasoning, achieving deeper coherence, consistency, and adaptability. Implemented in small language models with only 3 billion parameters, we should that even tiny models can iteratively teach themselves to reason with greater depth and reflectivity. Our implementation is straightforward and can be incorporated into any existing pretrained LLM. We focus our examples on applications in biological materials science and demonstrate the method in a variety of case studies that range from in-domain to cross-domain applications. Using reasoning strategies that include thinking and reflection modalities we build a multi-agent recursive self-improving inference approach to successively improve responses via repeated sampling in inference time.

PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking

TL;DR

This work introduces PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning), a framework that integrates preference optimization with reinforcement learning concepts for self-improving scientific reasoning, and finds that even small models self-teach deeper reasoning, solving open-domain problems effectively.

Abstract

PRefLexOR (Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning) combines preference optimization with concepts from Reinforcement Learning to enable models to self-teach through iterative reasoning improvements. We propose a recursive learning approach that engages the model in multi-step reasoning, revisiting, and refining intermediate steps before producing a final output in training and inference phases. Through multiple training stages, the model first learns to align its reasoning with accurate decision paths by optimizing the log odds between preferred and non-preferred responses. During this process, PRefLexOR builds a dynamic knowledge graph by generating questions from random text chunks and retrieval-augmentation to contextualize relevant details from the entire training corpus. In the second stage, preference optimization enhances model performance by using rejection sampling to fine-tune reasoning quality by continually producing in-situ training data while masking the reasoning steps. Recursive optimization within a thinking token framework introduces iterative feedback loops, where the model refines reasoning, achieving deeper coherence, consistency, and adaptability. Implemented in small language models with only 3 billion parameters, we should that even tiny models can iteratively teach themselves to reason with greater depth and reflectivity. Our implementation is straightforward and can be incorporated into any existing pretrained LLM. We focus our examples on applications in biological materials science and demonstrate the method in a variety of case studies that range from in-domain to cross-domain applications. Using reasoning strategies that include thinking and reflection modalities we build a multi-agent recursive self-improving inference approach to successively improve responses via repeated sampling in inference time.

Paper Structure

This paper contains 43 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Illustration of the workflow and design principles behind generative materials informatics. Panel a: The process of transforming information into knowledge and actionable outcomes. Each individual piece of information (left) is synthesized into a network of interconnected knowledge, leading to informed decisions and innovative designs (right). Panel b: Conventional approaches in materials science rely on data-driven models, partial differential equations (PDEs), and experimental results, focusing on single-step predictions. Panel c: In contrast, generative materials informatics models built on the PRefLexOR framework proposed in this paper use "thinking" and "reflection" explicitly by incorporating iterative reasoning and contextual understanding, allowing for more complex, multi-step predictions. This approach expands from single inference steps, includes multiple modalities of data and responses, integrates real-world feedback and physics, and leverages self-assessment and self-learning. Using using reinforcement learning (RL) principles, the discovery of principles or the solution of specific tasks is further inspired by biological paradigms, using bio-inspired neural network designs. These advanced methods support continuous improvement in material predictions, enabling more adaptable and intelligent designs.
  • Figure 2: Strategic Dataset Generation Process with Structured Thought Integration. This figure illustrates a novel approach to generating datasets, where random text chunks are selected from raw data sources (e.g., papers, books, documents, notes, etc.) and used to develop question-answer pairs in a structured and strategic manner. Panel a: The process begins with raw data, such as research papers or books, which is converted into a markup format. This allows the data to be broken down into smaller, manageable text chunks. These chunks form the basis for generating questions in the subsequent steps. Panel b: A random selection of text chunks is used to generate question-answer pairs. This step involves creating a question from the text chunk and deriving an initial answer from the content. However, what distinguishes this approach is the next phase where a structured reasoning process is applied. Panel c: The system incorporates strategic reasoning and reflection, facilitated by the use of special thinking tokens (for instance: <|thinking|> and <|/thinking|>). Within this structured reasoning framework, the system iterates over several steps: Identifying relevant materials and concepts from the text, forming reasoning steps, and generating hypotheses. These processes are crucial to refining and validating the answer. Reflection, reasoning, and hypothesis generation are integrated to ensure that the answers are derived thoughtfully and are not merely surface-level extractions from the text. The thinking and reflection phases add depth to the question-answer generation, making the dataset richer and more valuable for subsequent learning tasks.
  • Figure 3: PreFLexOR: Model development and training strategy overview. The process starts with a pretrained model (here, meta-llama/Llama-3.2-3B-Instruct). Phase 1 focuses on structured thought integration, with on-the-fly dataset generation as input. Phase 2 develops independent reasoning capabilities by first generating a dataset, applying masking, and then proceeding with training. The final result is an aligned model with reasoning capabilities.
  • Figure 4: Training performance using the EXO method across three key metrics, during Independent Reasoning Development. Panel a: The increase in rewards/margins over the course of training, indicating progressive improvement as the model learns. Panel b: The corresponding decrease in loss, showcasing successful convergence and optimization of the model, as reflected in a continuous decline in the loss function. Panel c: Rewards/accuracy during training, demonstrating rapid convergence toward high accuracy early in training, stabilizing after approximately 200 steps, with consistently high performance maintained throughout.
  • Figure 5: Structured Thought and Reflection in Answer Generation. This diagram illustrates the multi-step process of answer generation, incorporating both structured thinking and reflection phases to ensure thoroughness and accuracy. As in the original approach, the process begins with the <|thinking|> phase, where key reasoning steps are identified. This phase involves the following steps: (i) outlining the reasoning steps based on the available data, (ii) referencing relevant materials and concepts that support the reasoning process, (iii) forming hypotheses to guide the conclusion. After the initial thinking process, the system moves to the <|reflection|> phase, where the generated answer is refined. During this phase, improvements and corrections are made to ensure that the final output is accurate and relevant. The combination of these two phases—structured thinking and reflection—results in a robust and refined final answer, which is shown at the bottom of the diagram.
  • ...and 7 more figures