Table of Contents
Fetching ...

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

Dimitris Gkoumas, Maria Liakata

TL;DR

This work addresses the challenge of improving chemical language models with limited additional training by combining model merging and alignment fine-tuning in a crossmodal Molecule-Caption Generation setting. It introduces universal models obtained via weight-based (SLERP) and subspace-based (TIES) merging to fuse molecule and caption tasks without retraining on full data, followed by crossmodal alignment using SFT, DPO, CPO, or KTO with offline preferences. A novel atomic-level cross-NLI evaluation framework decomposes text into atomic premises and hypotheses to measure hallucination and coverage, demonstrating superior granularity and discriminative power over traditional NLI approaches. Experiments on the L+M-24 benchmark show that model merging with minimal training substantially outperforms fully trained baselines on out-of-distribution data, and the atomic-level NLI evaluation reveals nuanced insights into content integrity and completeness in generated captions.

Abstract

Scientific language models drive research innovation but require extensive fine-tuning on large datasets. This work enhances such models by improving their inference and evaluation capabilities with minimal or no additional training. Focusing on molecule caption generation, we explore post-training synergies between alignment fine-tuning and model merging in a cross-modal setup. We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models. Moreover, we propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain. Our experiments demonstrate that our evaluation operates at the right level of granularity, effectively handling multiple content units and subsentence reasoning, while widely adopted NLI methods consistently misalign with assessment criteria.

Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation

TL;DR

This work addresses the challenge of improving chemical language models with limited additional training by combining model merging and alignment fine-tuning in a crossmodal Molecule-Caption Generation setting. It introduces universal models obtained via weight-based (SLERP) and subspace-based (TIES) merging to fuse molecule and caption tasks without retraining on full data, followed by crossmodal alignment using SFT, DPO, CPO, or KTO with offline preferences. A novel atomic-level cross-NLI evaluation framework decomposes text into atomic premises and hypotheses to measure hallucination and coverage, demonstrating superior granularity and discriminative power over traditional NLI approaches. Experiments on the L+M-24 benchmark show that model merging with minimal training substantially outperforms fully trained baselines on out-of-distribution data, and the atomic-level NLI evaluation reveals nuanced insights into content integrity and completeness in generated captions.

Abstract

Scientific language models drive research innovation but require extensive fine-tuning on large datasets. This work enhances such models by improving their inference and evaluation capabilities with minimal or no additional training. Focusing on molecule caption generation, we explore post-training synergies between alignment fine-tuning and model merging in a cross-modal setup. We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models. Moreover, we propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain. Our experiments demonstrate that our evaluation operates at the right level of granularity, effectively handling multiple content units and subsentence reasoning, while widely adopted NLI methods consistently misalign with assessment criteria.
Paper Structure (38 sections, 4 equations, 19 figures, 5 tables)

This paper contains 38 sections, 4 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Overview of our proposed post-training approach to address key limitations in chemical LLMs. Top: Merging per-task pretrained models to create a universal model (refer to $\S~\ref{['sec:model_merging']}$). Bottom: Generating synthetic preference data using pretrained per-task encoder–decoders (refer to $\S~\ref{['sec:setup']}$) for alignment tuning .
  • Figure 2: Model merging techniques for obtaining universal models. (A) Weight-based merging via spherical interpolation. (B) Subspace-based merging by pruning and merging parameter magnitudes. $\tau_1$ and $\tau_2$ are task vectors obtained from pretrained molecule and caption generation models, respectively.
  • Figure 3: The process of atomic-level cross-NLI evaluation when measuring the level of hallucination.
  • Figure 4: Ablation of best performance model, CPO+SLERP, for Mol2Cap and Cap2Mol tasks, evaluating the effect of per-task model weight mixing ratios.
  • Figure 5: Score distributions from our atomic-level cross-NLI evaluation comparing (A) hallucination and (B) coverage between our top models and Meditron.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Definition 1: HALOs