Table of Contents
Fetching ...

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

Wei Lu, Rachel K. Luu, Markus J. Buehler

TL;DR

<3-5 sentence high-level summary>This paper investigates how to adapt large language models to domain-specific engineering and materials science tasks via a systematic suite of fine-tuning and merging strategies. It evaluates Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO), and demonstrates that merging multiple specialized models with SLERP can yield emergent capabilities not present in the parents, particularly in larger models. Across Llama-3.1, Mistral, and SmolLM architectures, the study highlights scale as a key factor for emergent behavior, while also showing that data quality and command of domain prompts significantly influence performance. The work also extends to interactive, cross-domain tasks including multi-turn conversations and image-generation prompts, illustrating practical avenues for applying domain-adapted LLMs in materials design and urban planning. These findings inform strategies for building scalable, domain-focused reasoning engines and open questions about data quality, scaling, and the limits of model merging.

Abstract

The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.

Fine-tuning large language models for domain adaptation: Exploration of training strategies, scaling, model merging and synergistic capabilities

TL;DR

<3-5 sentence high-level summary>This paper investigates how to adapt large language models to domain-specific engineering and materials science tasks via a systematic suite of fine-tuning and merging strategies. It evaluates Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Odds Ratio Preference Optimization (ORPO), and demonstrates that merging multiple specialized models with SLERP can yield emergent capabilities not present in the parents, particularly in larger models. Across Llama-3.1, Mistral, and SmolLM architectures, the study highlights scale as a key factor for emergent behavior, while also showing that data quality and command of domain prompts significantly influence performance. The work also extends to interactive, cross-domain tasks including multi-turn conversations and image-generation prompts, illustrating practical avenues for applying domain-adapted LLMs in materials design and urban planning. These findings inform strategies for building scalable, domain-focused reasoning engines and open questions about data quality, scaling, and the limits of model merging.

Abstract

The advancement of Large Language Models (LLMs) for domain applications in fields such as materials science and engineering depends on the development of fine-tuning strategies that adapt models for specialized, technical capabilities. In this work, we explore the effects of Continued Pretraining (CPT), Supervised Fine-Tuning (SFT), and various preference-based optimization approaches, including Direct Preference Optimization (DPO) and Odds Ratio Preference Optimization (ORPO), on fine-tuned LLM performance. Our analysis shows how these strategies influence model outcomes and reveals that the merging of multiple fine-tuned models can lead to the emergence of capabilities that surpass the individual contributions of the parent models. We find that model merging leads to new functionalities that neither parent model could achieve alone, leading to improved performance in domain-specific assessments. Experiments with different model architectures are presented, including Llama 3.1 8B and Mistral 7B models, where similar behaviors are observed. Exploring whether the results hold also for much smaller models, we use a tiny LLM with 1.7 billion parameters and show that very small LLMs do not necessarily feature emergent capabilities under model merging, suggesting that model scaling may be a key component. In open-ended yet consistent chat conversations between a human and AI models, our assessment reveals detailed insights into how different model variants perform and show that the smallest model achieves a high intelligence score across key criteria including reasoning depth, creativity, clarity, and quantitative precision. Other experiments include the development of image generation prompts based on disparate biological material design concepts, to create new microstructures, architectural concepts, and urban design based on biological materials-inspired construction principles.
Paper Structure (31 sections, 13 equations, 19 figures, 7 tables)

This paper contains 31 sections, 13 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Overview of the approach used in this study, including the scientific training corpus and information processing. Panel A: The training corpus comprises raw text from various sources such as papers, documents, and websites. This text undergoes extraction of key insights, reasoning, and logical deduction, leading to the generation of question-answer or instruction-response pairs. Panel B: Visualization of the transformation from individual pieces of information (here shown as scattered nodes of varying sizes) to a structured network of interconnected insights, illustrating the consolidation of knowledge through the training process. This overall schematic illustrates the goals of this research, to build models for complex problems that integrate distinct features, modalities, and concepts. The image above "Actionable outcomes" was generated using lamm-mit/leaf-flux.
  • Figure 2: Model training, merging and assessment stages. Panel A: A conventional training pipeline where a base model undergoes Continued Pre-Training (CPT), followed by Supervised Fine-Tuning (SFT), and then optimized using methods like Direct Preference Optimization (DPO) or Odds Ratio Preference Optimization (ORPO) to produce a trained model. Assessment of the model can be performed at each of the stages, such as using the SFT results for benchmarking. Panel B: An alternative pipeline where, after CPT, SFT, and optimization (e.g., DPO, ORPO), the model is further enhanced by merging it with another fine-tuned model (e.g., a general-purpose model). Merging can be done with models extracted from various training stages, such as after CPT, SFT or at the final stage.
  • Figure 3: Comparison of SLERP (Spherical Linear Interpolation) and LERP (Linear Interpolation) between two points on a unit sphere, illustrating their application in merging Large Language Model (LLM) parameters. SLERP interpolates between points $\mathbf{p}_1$ and $\mathbf{p}_2$ along a spherical path on the surface of the sphere, calculated as $\text{SLERP}(t)$, where $t$ is the interpolation parameter (equations see main text). In contrast, LERP interpolates linearly between the same two points, following a straight line through the sphere. Intermediate points at 30% and 70% along both paths are highlighted, showing the difference in how SLERP and LERP handle interpolation. In the context of LLMs, SLERP is particularly effective for merging model parameters from different pre-trained models, facilitating the emergence of new abilities that neither parent model possessed alone. The smooth, nonlinear path of SLERP helps to preserve the underlying structure of the model parameters, represented by the unit sphere, potentially unlocking novel interactions between features that lead to enhanced performance and the development of emergent capabilities. The sphere in this context represents the inherent structure of the model's parameter space, and by maintaining the geometric relationship between the parameters, SLERP ensures that the interpolation respects this original structure and does not puncture it (as the LERP points would), leading to meaningful and coherent blending of capabilities rather than random, unstructured changes. A key point is that because the merged points are both congruent with the model geometry (that is, they lie on the sphere used here for demonstration) and because they realize new points previously not accessed, emergent features and capabilities could potentially be unlocked.
  • Figure 4: Performance evaluation $P$ of Llama-3.1 model variants across benchmarks. Panel A: Accuracy results for various variants on different benchmarks: Spider Silk, Bio-inspired/Biological Materials, and Overall Accuracy. The models were evaluated after undergoing different training and optimization strategies (CPT, SFT, ORPO/DPO, model merging). Panel B: Relative improvement of model variants over the meta-llama/Meta-Llama-3.1-8B-Instruct baseline model. This highlights how each training strategy contributes to the model's performance gains or losses across the various benchmarks, providing insight into the effectiveness of different approaches. It is notable that models that underwent CPT, SFT, and to some extent preference optimization (e.g., DPO, ORPO) show a deterioration in performance, as indicated by negative relative improvement values. However, after applying the SLERP merging technique, these same models exhibit significant performance gains, surpassing the baseline model. This highlights the effectiveness of model merging in combining the strengths of different specialized models, resulting in a robust final model with superior overall performance. Overall, the results show that the models that have undergone SLERP merging (especially those combined with DPO and ORPO strategies) generally show the highest accuracy across benchmarks. Merging in this case is always done with meta-llama/Meta-Llama-3.1-8B-Instruct. All models have been trained with the same datasets in all stages, as shown in Table \ref{['tab:dataset_summary']}.
  • Figure 5: Performance evaluation $P$ of Mistral-7B-v0.3 model variants. Panel A: Accuracy results for various Mistral-7B-v0.3 model variants on the Spider Silk, Bio-inspired/Biological Materials, and Overall Accuracy benchmarks. Initial models trained with CPT and SFT show moderate performance. Models further optimized using ORPO or DPO exhibit significant improvements in accuracy across all benchmarks. Model merging results in further significant improvements. The relative improvements are even more pronounced than those seen in the Llama-3.1 models (here exceeding 20% versus around 12%), indicating the particular effectiveness of these techniques for the Mistral series. Panel B: Relative improvement of model variants over the baseline mistralai/Mistral-7B-Instruct-v0.3 model. The Base model subjected to CPT alone initially show a decrease in relative performance. However, after SFT, ORPO and especially after applying SLERP merging, especially with ORPO or DPO optimization, these models demonstrate substantial positive relative improvement, surpassing the baseline by a greater margin than the improvements seen in the Llama-3.1 models. This highlights the powerful impact of these combined strategies in enhancing the overall performance of the Mistral models. It is notable that a direct merge of the Base-CPT-SFT model results in significant performance, close to the Instruct-CPT-SFT strategy. Merging is always done with mistralai/Mistral-7B-Instruct-v0.3. The same training set is used for all experiments, as defined in Table \ref{['tab:dataset_summary']}.
  • ...and 14 more figures