Researchers construct models to predict SARS-CoV-2 mutable and constrained positions

Many variants of concern (VOCs) display mutations in significant positions in the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome. These often cause more severe disease, more rapid transmission, or increased immune evasion. In a study published in PNAS, researchers have attempted to create a model that can successfully predict which positions are likely to mutate in the future.

Study: Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes. Image Credit: Iurii Kachkovskyi/Shutterstock

The study

The researchers extracted multiple sequence alignments (MSA) for the 39 protein domains that make up the SARS-CoV-2 proteome from publicly available databases. The extracted sequences do not come from SARS-CoV-2 but other Coronaviridae. These were then used to train independent (IND) and epistatic (DCA) models, which were applied to a reference strain of wild-type SARS-CoV-2 to predict the mutability of each site.

The models were validated using deep mutational scanning (DMS) data for protein expression and compared the experimentally measured effects with predictions from the models. Sequences extracted from GISAID were then used to estimate empirical variability.

For both the IND and DCA approaches, the MSA of homologous proteins was used to learn a family-specific sequence landscape 'statistical energy' (SE), which provides lower values to functional sequences and higher values to bad sequences. Entries may be of any of the natural 20 amino acids or an alignment gap, allowing any variant with at least one mutation to be characterized by this statistical energy change. The change in SE, calculated by taking the SE from the variant away from the SE of the reference, can be averaged over all changes in a single position reachable by one mutation, providing the mutability scores.

The same 39 domains were used to extract a second MSA from variants from the GISAID database to test those mutability scores. Redundant amino acid sequences were removed, and each distinct sequence was kept only once, as the extremely varied sequencing efforts from many different countries raised the risk of frequency biases. Position-specific observed variability is the number of distinct sequences in the MSA having a mutation in a certain position compared to the wild-type reference sequences. As the sequence with the most existing data is the RBD, this was used to validate the predictions for the remainder of the domains.

The researchers then compared the predicted mutational effects with experimental protein expression, focusing on position-specific mutability. This was gathered by averaging predictions and experiments on overall accessible amino acid changes in a specific position. The DCA model outperformed the IND model in this capacity, showing a superior correlation with experimental expression. This trend continued when individual and amino-acid-specific predictions were checked. The model did predict some mutations to be deleterious which were listed as neutral, which the scientists suggest is due to either undersampling or a lack of effect on expression.

The scientists then tested if they could use the epistatic model to predict new variants by identifying positions with favorable mutability scores. They compared currently observable variability with the model-based mutability score and the mutations expected by experimental protein expression. They found that the DCA model showed a significantly closer correlation with variability than the IND model, with higher significance. The scientists used the deep learning tool Deep Sequence to confirm these results further. They found that its predictions correlated well with the DCA predictions, albeit with smaller correlations of protein expression and variability.

Following this, the researchers plotted the Immune Epitope Database (IEDB) RF (number of responding subjects relative to total number tested averaged over all epitopes for a single position)  against the DCA mutability score for each position in the RBD domain. They found a single restricted set of positions that showed high DCA and RF scores simultaneously, four of which are observed circulating in variants of concern. Prominent positions such as N501 and E484 are included here. The researchers highlight this technique's potential for identifying which positions are likely candidates for mutations that could lead to some immune evasion.


The scientists have shown that their computational predictions can anticipate which positions are likely to mutate in SARS-CoV-2 and which positions have a high potential to confer immune escape. Four of the nine positions are currently mutated in variants of concern or variants of interest, and the researchers advise monitoring the other positions in new variants. This information could be useful for epidemiologists and help predict the next dominant variant, perhaps informing public health policy.

Journal reference:
  • Rodriguez-Rivas, J. et al. (2022) "Epistatic models predict mutable sites in SARS-CoV-2 proteins and epitopes", Proceedings of the National Academy of Sciences, 119(4), p. e2113118119. doi: 10.1073/pnas.2113118119.

Posted in: Medical Science News | Medical Research News | Disease/Infection News

Tags: Amino Acid, Coronavirus, Coronavirus Disease COVID-19, Deep Learning, Frequency, Genome, Homologous, Mutation, Protein, Protein Expression, Proteome, Public Health, Respiratory, SARS, SARS-CoV-2, Severe Acute Respiratory, Severe Acute Respiratory Syndrome, Syndrome

Comments (0)

Written by

Sam Hancock

Sam completed his MSci in Genetics at the University of Nottingham in 2019, fuelled initially by an interest in genetic ageing. As part of his degree, he also investigated the role of rnh genes in originless replication in archaea.

Source: Read Full Article