Table of Contents
Fetching ...

Factorization Machine with Quadratic-Optimization Annealing for RNA Inverse Folding and Evaluation of Binary-Integer Encoding and Nucleotide Assignment

Shuta Kikuchi, Shu Tanaka

TL;DR

This study evaluated all 24 possible assignments of the four nucleotides to the ordered integers, in combination with four binary-integer encoding methods, and demonstrated that one-hot and domain-wall encodings outperform binary and unary encodings in terms of the normalized ensemble defect value.

Abstract

The RNA inverse folding problem aims to identify nucleotide sequences that preferentially adopt a given target secondary structure. While various heuristic and machine learning-based approaches have been proposed, many require a large number of sequence evaluations, which limits their applicability when experimental validation is costly. We propose a method to solve the problem using a factorization machine with quadratic-optimization annealing (FMQA). FMQA is a discrete black-box optimization method reported to obtain high-quality solutions with a limited number of evaluations. Applying FMQA to the problem requires converting nucleotides into binary variables. However, the influence of integer-to-nucleotide assignments and binary-integer encoding on the performance of FMQA has not been thoroughly investigated, even though such choices determine the structure of the surrogate model and the search landscape, and thus can directly affect solution quality. Therefore, this study aims both to establish a novel FMQA framework for RNA inverse folding and to analyze the effects of these assignments and encoding methods. We evaluated all 24 possible assignments of the four nucleotides to the ordered integers (0-3), in combination with four binary-integer encoding methods. Our results demonstrated that one-hot and domain-wall encodings outperform binary and unary encodings in terms of the normalized ensemble defect value. In domain-wall encoding, nucleotides assigned to the boundary integers (0 and 3) appeared with higher frequency. In the RNA inverse folding problem, assigning guanine and cytosine to these boundary integers promoted their enrichment in stem regions, which led to more thermodynamically stable secondary structures than those obtained with one-hot encoding.

Factorization Machine with Quadratic-Optimization Annealing for RNA Inverse Folding and Evaluation of Binary-Integer Encoding and Nucleotide Assignment

TL;DR

This study evaluated all 24 possible assignments of the four nucleotides to the ordered integers, in combination with four binary-integer encoding methods, and demonstrated that one-hot and domain-wall encodings outperform binary and unary encodings in terms of the normalized ensemble defect value.

Abstract

The RNA inverse folding problem aims to identify nucleotide sequences that preferentially adopt a given target secondary structure. While various heuristic and machine learning-based approaches have been proposed, many require a large number of sequence evaluations, which limits their applicability when experimental validation is costly. We propose a method to solve the problem using a factorization machine with quadratic-optimization annealing (FMQA). FMQA is a discrete black-box optimization method reported to obtain high-quality solutions with a limited number of evaluations. Applying FMQA to the problem requires converting nucleotides into binary variables. However, the influence of integer-to-nucleotide assignments and binary-integer encoding on the performance of FMQA has not been thoroughly investigated, even though such choices determine the structure of the surrogate model and the search landscape, and thus can directly affect solution quality. Therefore, this study aims both to establish a novel FMQA framework for RNA inverse folding and to analyze the effects of these assignments and encoding methods. We evaluated all 24 possible assignments of the four nucleotides to the ordered integers (0-3), in combination with four binary-integer encoding methods. Our results demonstrated that one-hot and domain-wall encodings outperform binary and unary encodings in terms of the normalized ensemble defect value. In domain-wall encoding, nucleotides assigned to the boundary integers (0 and 3) appeared with higher frequency. In the RNA inverse folding problem, assigning guanine and cytosine to these boundary integers promoted their enrichment in stem regions, which led to more thermodynamically stable secondary structures than those obtained with one-hot encoding.
Paper Structure (20 sections, 8 equations, 10 figures, 1 table)

This paper contains 20 sections, 8 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Overview of the RNA inverse folding problem. White circles represent arbitrary nucleotides in target secondary structure, and black lines indicate base pairs formed through hydrogen bonding. A, U, G, and C denote adenine, uracil, guanine, and cytosine, respectively.
  • Figure 2: Schematic illustration of the proposed FMQA for the RNA inverse folding problem. As an example, the BB function is defined as the NED.
  • Figure 3: Normalized ensemble defect values obtained by FMQA under different combinations of binary-integer encoding methods and integer-to-nucleotide assignments. Crosses indicate the average NED over $10$ runs. The upper and lower whiskers denote the maximum and minimum NED values, respectively. Black circle, red triangle, blue square, and green diamond represent outliers. Panels in the top row correspond to assignments in which A or U was assigned to integer $0$, whereas panels in the bottom row correspond to assignments in which G or C was assigned to integer $0$.
  • Figure 4: Success rate obtained by FMQA under different combinations of binary-integer encoding methods and integer-to-nucleotide assignments. Panels in the top row correspond to assignments in which A or U was assigned to integer $0$, whereas panels in the bottom row correspond to assignments in which G or C was assigned to integer $0$.
  • Figure 5: Minimum free energy values obtained by FMQA under different combinations of binary-integer encoding methods and integer-to-nucleotide assignments. Only success solutions are plotted. When two or more success solutions were obtained, their average MFE value is indicated by a cross marker. When three or more success solutions were obtained, the standard deviation is shown as error bars. Panels in the top row correspond to assignments in which A or U was assigned to integer $0$, whereas panels in the bottom row correspond to assignments in which G or C was assigned to integer $0$.
  • ...and 5 more figures