Prediction of Alternative RNA Conformations with Dependency Map and VQVAE

Introduction: Why alternative RNA structures matter

Imagine an RNA transcript that can fold into two different secondary structures depending on its environment – each fold changes which bases are exposed, thereby altering interactions with proteins or other RNAs. These alternative conformations act as molecular switches in regulation, and recovering them from experimental data is a central challenge in regulatory genomics.

Fig. 1: Primary structure -> Secondary structure -> 3D structure

Dimethyl sulfate mutational profiling with sequencing (DMS-MapSeq) probes RNA structure in vivo by chemically modifying unpaired adenines and cytosines; the modifications appear as mutations in sequencing reads and provide a window into structural states [1]. In cases of structural heterogeneity, reads are mixtures of signals from different conformations, and deconvolving these mixed populations typically requires high coverage and careful modeling.

The approach described here investigates how structural priors, encoded as dependency maps (learned or computed pairwise relationships between nucleotides), interact with discrete latent models – specifically Vector-Quantized Variational Autoencoders (VQ-VAEs) – to improve robustness of conformation discovery from sparse DMS-like data. Rather than focusing primarily on final accuracy numbers, the emphasis is on conceptual mechanisms: how structural information can be represented, injected into a model, and leveraged to stabilise unsupervised clustering when the mutation signal is weak.

Conceptual ingredients

Three conceptual building blocks underpin the study: (1) the DMS signal as a noisy indicator of single-nucleotide pairing status, (2) dependency maps as compact structural priors, and (3) VQ-VAE as a discrete latent model that can partition reads into conformation-specific clusters.

DMS signal and its limitations

DMS mutations provide direct but noisy evidence for unpaired bases. The strength of this signal depends on the experimental mutation rate and on how long a read spans the region of interest (window size). At low mutation rates or short windows, per-read information can be nearly indistinguishable from noise. Under such regimes, purely data-driven clustering can fail to form biologically meaningful groups.

Dependency maps as structural priors

A dependency map captures pairwise associations between nucleotide positions (for example, derived from a language model, covariance analysis, or prior structural predictions). Conceptually, it encodes which positions are likely to be interacting (e.g., base pairs) or otherwise co-constrained. The dependency map can be used in multiple ways:

as an attention bias that nudges the model to consider particular pairwise interactions more strongly,
as an auxiliary objective that encourages learned discrete representations to align with a reduced structural profile, or
as an additional input track that provides direct per-position structural importance.

Providing this prior helps the model disambiguate weak mutation signals by offering an orthogonal source of information tied to folding physics and evolutionary constraints.

VQ-VAE for discrete clustering

Vector-Quantized VAEs produce a discrete latent representation by mapping encoder outputs to a small codebook of prototypes. Formally, an encoder produces continuous vectors \(z_e\) which are quantized to the nearest codebook vectors \(z_q\) drawn from \(E={e_1,\dots,e_K}\). Discrete latents are desirable when the goal is to partition reads into a small number of conformational states (e.g., \(K=4\) codebook entries in the experiments). The codebook is optimized jointly with the encoder and decoder via a combination of reconstruction and codebook/commitment losses, stabilised using the straight-through estimator for gradient flow [2].

Model design: a Multi-track, Dependency-Aware pipeline

The model is a two-stage pipeline: first, a Transformer is pre-trained with a Masked Language Modeling (MLM) objective to learn local context in mutation sequences; second, the pre-trained Transformer becomes the encoder of a VQ-VAE (the DMSVQVAE), yielding discrete latent representations of DMS profiles.

Multi-track input representation

The Transformer accepts a multi-track representation with four conceptual channels:

Position track: per-position tokens (p_i) that encode sequence coordinates and provide spatial context.
Mutation track: the primary binary input (‘0’ wild-type, ‘1’ mutated) representing the DMS read.
Condition track: a global condition token repeated across positions (e.g., ligand vs. control).
Dependency map track: a 1-D summary of a 2-D dependency map produced by averaging pairwise scores per nucleotide, binning the resulting profile, and encoding bins as tokens.

This modular design enables direct experiments on how each modality contributes to representation learning.

Dependency-Aware attention: SoftGateAttention

A central architectural innovation is a modified attention mechanism – SoftGateAttention – that incorporates the dependency map as an additive bias to attention scores prior to the softmax:

\[ \tilde{A}*{ij} = \frac{(QK^\top)*{ij}}{\sqrt{d_{\text{model}}}} + \gamma \cdot \text{dep\_map}_{ij}, \]\[ \text{Attention}(Q,K,V) = \text{softmax}(\tilde{A})V, \]

where \(\gamma\) controls the strength of the structural bias and \(\text{dep\_map}_{ij}\) encodes the prior association between positions \(i\) and \(j\). Activating this bias guides the model to allocate attention mass to structurally plausible interactions, which is especially useful when mutation evidence alone is insufficient.

VQ-VAE integration

The encoder (the pre-trained Transformer) maps inputs to continuous latent vectors (z_e), which the Vector Quantizer maps to the nearest codebook vectors \(z_q\). The optimization objective combines MLM, reconstruction, codebook, and commitment losses:

\[ L_{\text{codebook}} = | \text{sg}(z_e) - z_q |_2^2, \]\[ L_{\text{commit}} = \beta | z_e - \text{sg}(z_q) |_2^2, \]

and the overall loss

\[ L_{\text{total}} = \alpha_{\text{MLM}} L_{\text{MLM}} + \alpha_{\text{recon}} L_{\text{recon}} + L_{\text{codebook}} + \beta L_{\text{commit}}. \]

Hyperparameters set in the experiments were \(\alpha_{\text{MLM}}=\alpha_{\text{recon}}=1.0\), \(\beta=0.25\), and codebook size \(K=4\).

Mechanisms for injecting structural priors – conceptual tradeoffs

Three distinct strategies for incorporating the dependency map were investigated. Each has a different conceptual footprint in terms of expressivity and inductive bias.

Attention Bias (early inductive nudging)

Injecting dep_map as an additive bias to attention scores acts early in the representation stack. Conceptually, this is an inductive nudging mechanism: the model’s receptive field is reweighted to prefer pairs with prior structural evidence. This can accelerate discovery of conformation-relevant patterns but risks over-committing to the prior when it is noisy.

Auxiliary loss on the codebook (global alignment)

Reducing the 2-D dep_map to a 1-D structural profile and aligning the mean codebook embedding to this profile via an MSE term encourages the discrete prototypes themselves to capture structural roles. Conceptually, this is a global regularizer on the discrete latent space: it does not force per-read assignments, but it biases the codebook prototypes toward reflecting the overall structural profile of positions. This can stabilise the codebook when individual reads are weakly informative, but it may reduce per-read flexibility.

Dependency map as a separate input track (direct feature provision)

Providing the dep_map as an input track gives the model explicit, per-position structural signals. This is the least prescriptive approach: the model can learn how to combine sequence and structural inputs. In low-signal regimes, direct access to structural scores is often the most robust option because the model can rely on the track rather than attempting to infer structure purely from mutations.

Experimental plan and conceptual evaluation

The experiments were designed on a fully synthetic toy dataset consisting of sequences that adopt two distinct conformations. The controlled setup allows isolation of conceptual effects: mutation rate (2%, 5%, 20%), window size (e.g., 50), and the presence/absence of the sequence track.

Key evaluation modalities focused on interpretability of learned representations rather than high-throughput benchmarks.

Training-time diagnostics

PCA of latent embeddings (z_e): visual checks for cluster separation corresponding to ground-truth conformations.
Cluster purity: a simple, real-time metric to assess whether codebook indices concentrate on single conformations.

These diagnostics reveal how early and how stably the model discovers distinct structural states.

Post-training metrics

Validation loss and reconstruction (binary cross-entropy) measure fit to data.
Perplexity of codebook usage diagnoses collapse or under-utilization of discrete latents.
Balanced accuracy of mapping from codebook indices to ground-truth conformations measures the final clustering quality.
Minimal base pair distance (using RNAsubopt with a codebook-derived constraint) probes biological relevance of learned prototypes.

Key findings – conceptual summary

The experiments emphasize concept over raw numbers. The main conceptual takeaways are:

Signal strength dictates learnability. At high mutation rates (e.g., 20%) and sufficiently long windows (e.g., 50), DMS patterns alone provide enough information for meaningful clustering; all variants including the baseline VQ-VAE form reasonable clusters. In such regimes, structural priors offer modest gains.
Structural priors stabilise learning under weak signal. At low mutation rates (2–5%) or with short windows, the mutation track is insufficiently informative. In these settings, the dependency map as a direct input track (the DepMap Track variant) consistently improved clustering purity and balanced accuracy. Conceptually, this demonstrates that giving the model explicit structural features is a powerful way to regularise latent learning when the primary signal is weak.
Method of integration matters. Attention biases (DepMap Attention) provide a lightweight inductive push that can help when priors are relatively accurate; however, if the dep_map is noisy, attention bias risks amplifying incorrect interactions. The auxiliary codebook loss offers global alignment benefits but can overly constrain prototype flexibility if weighting is too strong. The input-track approach strikes a favorable balance between robustness and flexibility because it lets the model decide how much to rely on the prior.
Discrete prototypes reveal structure and failure modes. Codebook analysis and RNAsubopt-based probing showed that some codebook prototypes capture conformation-specific structural profiles, whereas others act as catch-alls when the model cannot disambiguate reads. Perplexity is a useful diagnostic: extremely low perplexity indicates codebook collapse and loss of discriminative power.

Final Thoughts

Limitations and conceptual implications

Several important caveats inform the practical use of these ideas:

Toy dataset vs. biology. Controlled synthetic sequences illuminate mechanisms but do not capture full biological complexity (longer-range tertiary interactions, experimental biases, sequencing errors). The conceptual findings should therefore guide, not guarantee, performance on real DMS experiments.
Dependency map quality matters. The usefulness of structural priors depends on their fidelity. Learned or predicted dep_maps derived from miscalibrated models can mislead attention biases; hence, uncertainty about priors should be explicitly modelled or annealed.
Choice of K and codebook dynamics. Small codebooks (e.g., (K=4)) are interpretable and align with the notion of a few dominant conformations, but more complex ensembles may require larger or hierarchical discrete spaces.
Integration hyperparameters require care. Additive attention biases and auxiliary loss weights are potent levers; conceptually, they should be tuned with cross-validation or held-out structural signals to avoid overfitting to the prior.

Practical recommendations (conceptual checklist)

For practitioners interested in using structural priors with discrete latent models for RNA structural deconvolution, the following conceptual checklist may help:

Assess mutation signal strength (mutation rate, read length). When strong, simpler models can suffice; when weak, inject priors.
Use dep_map as an input track first – this offers robustness without strong prescriptive constraints.
Monitor codebook perplexity and PCA trajectories during training to detect collapse or late separation.
If using attention biases, anneal the bias strength (parameter (\gamma)) from small to larger values only if the prior is believed to be reliable.
Validate prototypes biologically (e.g., RNAsubopt probing) to ensure learned codebook vectors correspond to plausible structures.

Conclusion

Recovering alternative RNA conformations from noisy DMS-like read data is fundamentally an inference problem at the intersection of noisy observations and structural constraints. The conceptual contributions of this work are threefold:

the design of a multi-track Transformer that flexibly represents mutations, position, conditions, and structural priors;
the articulation and comparison of three mechanisms for injecting dependency maps (attention bias, auxiliary codebook alignment, and direct input track), together with their conceptual tradeoffs; and
the demonstration that structural priors stabilise discrete representation learning when primary signals are weak, with the input-track strategy providing the most consistent robustness in the toy experiments.

These insights form a conceptual blueprint for future applications: when signal is sparse, explicitly providing biologically informed priors to representation learners can convert an ill-posed clustering problem into a tractable one – provided that the prior is integrated in a flexible, well-regularised manner.

List of Abbreviations

Abbreviation	Meaning
DMS	Dimethyl sulfate mutational profiling
MLM	Masked Language Modeling
VQ-VAE	Vector-Quantized Variational Autoencoder
dep_map	Dependency map (pairwise nucleotide association)
STE	Straight-Through Estimator
PCA	Principal Component Analysis

References

[1] E. Morandi et al., “Genome-scale deconvolution of RNA structure ensembles,” Nat Methods, vol. 18, no. 3, pp. 249–252, Mar. 2021. doi: 10.1038/s41592-021-01075-w.

[2] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” arXiv:1711.00937, May 30, 2018. doi: 10.48550/arXiv.1711.00937.

[3] P. T. da Silva et al., “Nucleotide dependency analysis of DNA language models reveals genomic functional elements,” bioRxiv, July 27, 2024. doi: 10.1101/2024.07.27.605418.

[4] L. Moyon, “Prediction of alternative RNA conformations with RNALMs and VQVAEs.”

[5] R. J. Penić et al., “RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks,” arXiv, accessed June 10, 2025. Available: https://arxiv.org/abs/2403.00043v2.

[6] I. Borovská et al., “RNA secondary structure ensemble mapping in a living cell identifies conserved RNA regulatory switches and thermometers,” bioRxiv, Sept. 16, 2024. doi: 10.1101/2024.09.16.613214.

[7] E. Nguyen et al., “Sequence modeling and design from molecular to genome scale with Evo,” Science, vol. 386, no. 6723, p. eado9336, Nov. 2024. doi: 10.1126/science.ado9336.