Table of Contents
Fetching ...

The fitness landscape of overlapping genes

Orson Kirsch, Nicole Wood, Steven A Redford, Kabir Husain

Abstract

Natural genomes sometimes encode two different proteins in staggered reading frames of the same DNA sequence. Despite the prevalence of these 'overlapping genes' across the tree of life, it remains unknown whether arbitrary protein pairs can overlap, to what extent such overlaps are feasible, or what design principles govern them. Here, we study compatibility, frustration, and connectivity in the fitness landscape of overlapping genes. We computationally design sequences de novo that satisfy the dual functional constraints of two distinct protein families. The joint fitness landscape, inferred via Potts models from multiple sequence alignments, reveals a fundamental trade-off between the two proteins and provides a simple criterion for when overlap is feasible. We find widespread compatibility between protein families, with one class of reading frames markedly more permissible than others. By exploring alternative genetic codes, we find that the natural genetic code is uniquely well-suited to support overlapping genes. Constructing mutational paths between sequences, we find that sequence-diverse overlapped genes can be connected via a network of near-neutral mutations. Overall, our results suggest that protein fitness landscapes are sufficiently flexible so as to accommodate the stringent, orthogonal requirements of overlapping genes.

The fitness landscape of overlapping genes

Abstract

Natural genomes sometimes encode two different proteins in staggered reading frames of the same DNA sequence. Despite the prevalence of these 'overlapping genes' across the tree of life, it remains unknown whether arbitrary protein pairs can overlap, to what extent such overlaps are feasible, or what design principles govern them. Here, we study compatibility, frustration, and connectivity in the fitness landscape of overlapping genes. We computationally design sequences de novo that satisfy the dual functional constraints of two distinct protein families. The joint fitness landscape, inferred via Potts models from multiple sequence alignments, reveals a fundamental trade-off between the two proteins and provides a simple criterion for when overlap is feasible. We find widespread compatibility between protein families, with one class of reading frames markedly more permissible than others. By exploring alternative genetic codes, we find that the natural genetic code is uniquely well-suited to support overlapping genes. Constructing mutational paths between sequences, we find that sequence-diverse overlapped genes can be connected via a network of near-neutral mutations. Overall, our results suggest that protein fitness landscapes are sufficiently flexible so as to accommodate the stringent, orthogonal requirements of overlapping genes.

Paper Structure

This paper contains 17 sections, 5 equations, 6 figures.

Figures (6)

  • Figure 1: Overlapping genes must satisfy functional as well as coding constraints.(a) Schematic of a pair of overlapping genes, in which multiple reading frames of a single coding sequence are translated into different proteins. (b) Definition and nomenclature of reading frames studied in this work. (c) Compatible pairs of amino acids a and b (as defined in (b)) that can be encoded across from each other in each of the three reading frames. (d) Diagrammatic representation of the fitness landscape of overlapping genes, which must satisfy the constraints of folding and function of each gene family, as well as the coding constraints imposed by the overlapping reading frames.
  • Figure 2: Overlapping genes de novo with MSA-trained generative models.(a) Sequence alignments of homologues are used to train a Potts model representation of each protein's fitness landscape, which we combine to obtain a generative model for overlapping genes. (b) Potts model energies over Monte Carlo iterations for a sequence overlapping a Fibronectin type III domain (PFAM PF00041) with a two-component response regulator domain (PFAM PF00072). 300 replicate trajectories are shown, each independently initialised. (c) Histogram of initial and final energies of each protein product, normalised as z-scores to the distribution of natural energies, for the different replicates in (b). (d) Comparison of crystal structures of representative members of each protein family (left; PDB IDs 1TEN and 6TNE) with AlphaFold predictions for gene products from coding sequences with increasing overlap.
  • Figure 3: Replica exchange Monte Carlo maps the joint fitness landscape of overlapping genes.(a) Decreasing temperature $T_1$ ($T_2$) acts as an increased selection pressure that favours the constraints of protein family 1 (2) in the sampled sequence. (b) Histograms of sampled energies, normalised as z-scores, for sequences sampled at indicated temperatures. For this overlap (PF00004 $\times$ PF00072 at 217 nt), decreasing the temperature of one protein increases the other energy of the other (indicated by an orange arrowhead). (c) We implement a replica exchange Monte Carlo method to scan over all pairs of temperatures. (d, e) Heatmaps of z-scores for indicated protein pairs. Marked cross on left indicates the point at which both proteins simultaneously achieve the energy of the naturals (i.e. a z-score of 0). (f) Schematic of analysis, in which the energies of sequences sampled at all temperatures are plotted to identify a trade-off between satisfying the constraints of each protein family. (g, h) Trade-off analysis for indicated protein pairs. Grey points are samples from the joint fitness landscape. Dashed lines and black marker indicate the natural energies, with solid black arms denoting standard deviations of the naturals. In (h), the natural energies lie outside the region of achievable energies -- indicating that the landscape is 'frustrated' and an overlap is not feasible.
  • Figure 4: A systematic overlap of protein families finds widespread compatibility in the -2 reading frame.(a) Schematic workflow, in which we survey the compatibility of overlap between all pairs of 17 protein families. (b) Heat map showing the maximum compatible overlap length for each pair as (lower left) nucleotides, and (upper right) a fraction of the smaller gene. (c) Protein-family pairs successfully overlapped, as a function of overlap length and split by reading frame. Solid line is a smoothened curve, shown as a guide to the eye.
  • Figure 5: The standard genetic code is more permissive to overlapping than randomised codes.(a) Schematic of analysis, in which we shuffle the genetic code to produce two classes of randomised codes: type I, which preserves the degeneracy of the standard code (i.e. number of codons for each amino acid), and type II, which preserves the synonymous structure of the standard code (i.e., synonymous mutations in the standard curve remain synonymous). For each shuffled code, we repeat the trade-off analysis of Fig. 4 and measure the distance of the natural energies from the trade-off front. (b, d) Trade-offs for indicated overlapping pairs, computed for the standard code as well as one representative example each of the type I and type II shuffled codes. (c, e) Histograms of distance of the natural energies (in units of the z-score) from the trade-off fronts, computed over 100 type I and type II shuffled codes. Positive values indicate a frustrated overlap (i.e., natural energies are outside the trade-off front), while negative values indicate a feasible overlap (i.e., natural energies are inside the trade-off front). Dashed line is value under the standard code.
  • ...and 1 more figures