Table of Contents
Fetching ...

A Boltzmann-machine-enhanced Transformer For DNA Sequence Classification

Zhixuan Cao, Yishu Xu, Xuang WU

Abstract

DNA sequence classification requires not only high predictive accuracy but also the ability to uncover latent site interactions, combinatorial regulation, and epistasis-like higher-order dependencies. Although the standard Transformer provides strong global modeling capacity, its softmax attention is continuous, dense, and weakly constrained, making it better suited for information routing than explicit structure discovery. In this paper, we propose a Boltzmann-machine-enhanced Transformer for DNA sequence classification. Built on multi-head attention, the model introduces structured binary gating variables to represent latent query-key connections and constrains them with a Boltzmann-style energy function. Query-key similarity defines local bias terms, learnable pairwise interactions capture synergy and competition between edges, and latent hidden units model higher-order combinatorial dependencies. Since exact posterior inference over discrete gating graphs is intractable, we use mean-field variational inference to estimate edge activation probabilities and combine it with Gumbel-Softmax to progressively compress continuous probabilities into near-discrete gates while preserving end-to-end differentiability. During training, we jointly optimize classification and energy losses, encouraging the model to achieve accurate prediction while favoring low-energy, stable, and interpretable structures. We further derive the framework from the energy function and variational free energy to the mean-field fixed-point equations, Gumbel-Softmax relaxation, and the final joint objective. The proposed framework provides a unified view of integrating Boltzmann machines, differentiable discrete optimization, and Transformers for structured learning on biological sequences.

A Boltzmann-machine-enhanced Transformer For DNA Sequence Classification

Abstract

DNA sequence classification requires not only high predictive accuracy but also the ability to uncover latent site interactions, combinatorial regulation, and epistasis-like higher-order dependencies. Although the standard Transformer provides strong global modeling capacity, its softmax attention is continuous, dense, and weakly constrained, making it better suited for information routing than explicit structure discovery. In this paper, we propose a Boltzmann-machine-enhanced Transformer for DNA sequence classification. Built on multi-head attention, the model introduces structured binary gating variables to represent latent query-key connections and constrains them with a Boltzmann-style energy function. Query-key similarity defines local bias terms, learnable pairwise interactions capture synergy and competition between edges, and latent hidden units model higher-order combinatorial dependencies. Since exact posterior inference over discrete gating graphs is intractable, we use mean-field variational inference to estimate edge activation probabilities and combine it with Gumbel-Softmax to progressively compress continuous probabilities into near-discrete gates while preserving end-to-end differentiability. During training, we jointly optimize classification and energy losses, encouraging the model to achieve accurate prediction while favoring low-energy, stable, and interpretable structures. We further derive the framework from the energy function and variational free energy to the mean-field fixed-point equations, Gumbel-Softmax relaxation, and the final joint objective. The proposed framework provides a unified view of integrating Boltzmann machines, differentiable discrete optimization, and Transformers for structured learning on biological sequences.

Paper Structure

This paper contains 67 sections, 52 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: End-to-end training pipeline of the proposed BM-Transformer. The forward branch generates $Q$, $K$, and $V$, constructs a differentiable positive-phase structure through Gumbel-Softmax, and feeds the resulting gated representation into the downstream classifier. In parallel, the energy-learning branch evaluates positive- and negative-phase structures with the energy function, where the negative phase can optionally be obtained from an external solver or sampler. The final objective jointly combines task loss and energy loss, and gradients flow back through both branches to update the underlying network and the energy parameters.
  • Figure 2: Comparison of training accuracy curves. The left panel shows the Full BM-Transformer and the right panel shows the Plain Transformer. Both models converge stably and achieve similar validation accuracies near epoch 10.
  • Figure 3: Average activation strengths of the 16 latent hidden units. The clear differences across units indicate that the model does not rely uniformly on all latent modules, but instead emphasizes a smaller subset of more active structural components.
  • Figure 4: Visualization of pairwise interactions. The left panel is a clustered heatmap of the effective interaction-strength matrix, showing block-like coupling structures among positions; the right panel is a network formed by the strongest 0.5% positive and negative interaction edges, with red and blue denoting couplings of different signs.
  • Figure 5: Latent module--position map. The upper panel shows the signed module--position interaction strengths, and the lower panel shows their absolute values, which can be used to examine the hotspot distributions of different latent hidden units across the sequence.
  • ...and 1 more figures