Table of Contents
Fetching ...

Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering

Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom A. Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, Kaspar Märtens

TL;DR

The paper tackles the scarcity of labeled supervision for synthetic chain-of-thought data in biology by introducing an uncertainty-based filtering pipeline that relies on model-internal signals rather than external labels. It samples multiple reasoning traces per perturbation–gene input and uses the CoCoA score, a hybrid of self-consistency and perplexity, to select low-uncertainty traces, including a per-class filtering variant. On the PerturbQA benchmark, uncertainty-filtered data enable label-free supervised fine-tuning that outperforms unfiltered synthetic data and narrows the gap to ground-truth training, with 10% of the top traces yielding strong performance and balanced class accuracy. Ablations show that per-class filtering and the CoCoA-based hybrid metric provide the strongest improvements, suggesting general principles for data-efficient reasoning data curation. Overall, the approach demonstrates that model-internal confidence can drive efficient dataset creation, extending the applicability of large reasoning models to domains where supervision is expensive.

Abstract

Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs), improving generalization by providing step-level supervision. Yet most approaches require ground-truth labels to seed or filter these traces - an expensive bottleneck in domains like biology where wet-lab data are scarce. We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence - quantified through established uncertainty metrics like self-consistency and predictive perplexity - as a substitute for external labels. We sample multiple reasoning traces and retain only low-uncertainty subsets. Applied to biological perturbation prediction, a domain where wet-lab labels are especially costly, we show that the filtered subset has higher accuracy, and that supervised fine-tuning (SFT) on uncertainty-filtered data outperforms unfiltered synthetic data, narrows the gap to ground-truth training, and surpasses strong LRM baselines. Ablations show that per-class filtering corrects for class-specific uncertainty scales and that hybrid uncertainty metrics yield higher-quality datasets. Our results suggest that model-internal confidence is a powerful signal for efficient reasoning dataset creation, enabling LRMs in domains where supervision is expensive.

Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering

TL;DR

The paper tackles the scarcity of labeled supervision for synthetic chain-of-thought data in biology by introducing an uncertainty-based filtering pipeline that relies on model-internal signals rather than external labels. It samples multiple reasoning traces per perturbation–gene input and uses the CoCoA score, a hybrid of self-consistency and perplexity, to select low-uncertainty traces, including a per-class filtering variant. On the PerturbQA benchmark, uncertainty-filtered data enable label-free supervised fine-tuning that outperforms unfiltered synthetic data and narrows the gap to ground-truth training, with 10% of the top traces yielding strong performance and balanced class accuracy. Ablations show that per-class filtering and the CoCoA-based hybrid metric provide the strongest improvements, suggesting general principles for data-efficient reasoning data curation. Overall, the approach demonstrates that model-internal confidence can drive efficient dataset creation, extending the applicability of large reasoning models to domains where supervision is expensive.

Abstract

Synthetic chain-of-thought (CoT) traces are widely used to train large reasoning models (LRMs), improving generalization by providing step-level supervision. Yet most approaches require ground-truth labels to seed or filter these traces - an expensive bottleneck in domains like biology where wet-lab data are scarce. We propose a label-free alternative: uncertainty-based filtering, which uses a model's own confidence - quantified through established uncertainty metrics like self-consistency and predictive perplexity - as a substitute for external labels. We sample multiple reasoning traces and retain only low-uncertainty subsets. Applied to biological perturbation prediction, a domain where wet-lab labels are especially costly, we show that the filtered subset has higher accuracy, and that supervised fine-tuning (SFT) on uncertainty-filtered data outperforms unfiltered synthetic data, narrows the gap to ground-truth training, and surpasses strong LRM baselines. Ablations show that per-class filtering corrects for class-specific uncertainty scales and that hybrid uncertainty metrics yield higher-quality datasets. Our results suggest that model-internal confidence is a powerful signal for efficient reasoning dataset creation, enabling LRMs in domains where supervision is expensive.

Paper Structure

This paper contains 29 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Uncertainty-filtered synthetic reasoning pipeline. Step 1: Generate multiple synthetic reasoning traces with predicted outcomes from unlabeled perturbation–gene pairs. Step 2: Score each trace for internal uncertainty (self-consistency and perplexity) and retain only low-uncertainty traces. Step 3: Use the retained traces as a label-free dataset for supervised fine-tuning (SFT), improving reasoning models without ground-truth labels.
  • Figure A2: F1 score of upregulated genes stratified by CoCoA uncertainty deciles. Lower-uncertainty subsets yield consistently higher F1, with a clear monotonic trend across deciles. This confirms that uncertainty is strongly predictive of reasoning quality.
  • Figure A3: F1 score of downregulated genes stratified by CoCoA uncertainty deciles. Lower-uncertainty subsets yield consistently higher F1, with a clear monotonic trend across deciles. This confirms that uncertainty is strongly predictive of reasoning quality.
  • Figure A4: Prompt template used for data generation, SFT, and evaluation.