Table of Contents
Fetching ...

Learning Extrapolative Sequence Transformations from Markov Chains

Sophia Hager, Aleem Khan, Andrew Wang, Nicholas Andrews

TL;DR

This work presents a data-driven framework to learn extrapolative sequence transformations by training an autoregressive model $q_ heta$ on carefully selected Markov-chain states produced by Metropolis-Hastings sampling with mask-infilling proposals. The model is finetuned to iteratively transform an initial sequence toward higher target scores, using a two-phase setup: (1) construct an energy-based surrogate via $p(x) \\propto \,\exp(s(x))$ with intractable $Z$, and (2) train $q_ heta$ on short training episodes that condition on history and scores. Across protein engineering (ACE2 ddG stability), sentiment control on Yelp data, and text anonymization, $q_ heta$ achieves competitive or superior extrapolation performance with substantially fewer iterations than MCMC, and in some cases surpasses the best MCMC results. The approach leverages pre-trained denoising language models as proposals, enabling scalable, sample-efficient extrapolation and offering practical benefits for sequence design and controllable generation.

Abstract

Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emph{extrapolate} beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well-designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.

Learning Extrapolative Sequence Transformations from Markov Chains

TL;DR

This work presents a data-driven framework to learn extrapolative sequence transformations by training an autoregressive model on carefully selected Markov-chain states produced by Metropolis-Hastings sampling with mask-infilling proposals. The model is finetuned to iteratively transform an initial sequence toward higher target scores, using a two-phase setup: (1) construct an energy-based surrogate via with intractable , and (2) train on short training episodes that condition on history and scores. Across protein engineering (ACE2 ddG stability), sentiment control on Yelp data, and text anonymization, achieves competitive or superior extrapolation performance with substantially fewer iterations than MCMC, and in some cases surpasses the best MCMC results. The approach leverages pre-trained denoising language models as proposals, enabling scalable, sample-efficient extrapolation and offering practical benefits for sequence design and controllable generation.

Abstract

Most successful applications of deep learning involve similar training and test conditions. However, tasks such as biological sequence design involve searching for sequences that improve desirable properties beyond previously known values, which requires novel hypotheses that \emph{extrapolate} beyond training data. In these settings, extrapolation may be achieved by using random search methods such as Markov chain Monte Carlo (MCMC), which, given an initial state, sample local transformations to approximate a target density that rewards states with the desired properties. However, even with a well-designed proposal, MCMC may struggle to explore large structured state spaces efficiently. Rather than relying on stochastic search, it would be desirable to have a model that greedily optimizes the properties of interest, successfully extrapolating in as few steps as possible. We propose to learn such a model from the Markov chains resulting from MCMC search. Specifically, our approach uses selected states from Markov chains as a source of training data for an autoregressive model, which is then able to efficiently generate novel sequences that extrapolate along the sequence-level properties of interest. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.

Paper Structure

This paper contains 57 sections, 4 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: The sentiment extrapolation task (\ref{['sec:sentiment']}) requires generating reviews with ratings beyond the range observed at training time. The search process is illustrated using a toy 1D representation of the features (x-axis) and rating (y-axis). Monte Carlo exploration can produce reviews that extrapolate, but many steps are required. However, once good state sequences have been discovered, we can sub-sample the transitions that decrease the rating (A $\rightarrow$ C $\rightarrow$ N) and use them to learn an extrapolative model. The reviews shown to the right for states B, C, and N are actual reviews generated by our method, while A is a genuine review from the validation data.
  • Figure 2: In the protein engineering task, comparing MCMC performance (solid line) over ten epochs, or 830 steps, compared to the performance of $q_\theta$ (dotted line) trained on MCMC data generated on one epoch, or 83 steps. We find that MCMC does not approach the performance of $q_\theta$ and does not notably improve after even two epochs.