The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused worldwide disruption, causing over 5.21 million deaths and shutting down the global economy for months.
The virus deserves to be studied in detail in order to find novel and effective ways to combat it, as well as to develop broad-spectrum approaches to, hopefully, combat future pandemics.
A new study reveals the results of examining the transcripts of the virus in order to understand how it expresses its gene products in the infected host. This could help develop effective drugs to fight the virus in the future.
Study: Nanopore ReCappable Sequencing maps SARS-CoV-2 5′ capping sites and provides new insights into the structure of sgRNAs . Image Credit: Red-Diamond/Shutterstock
A preprint version of the study is available on the bioRxiv * server, while the article undergoes peer review.
Background SARS-CoV-2 is a virus with a single positive-sense ribonucleic acid (RNA) genome that is among the largest viral genomes known so far. The RNA molecule has a capped 5ʹ untranslated region (UTR) with a leader transcription regulatory sequence (TRS-L), a large open reading frame (ORF) called ORF1ab, and several other ORFs.
While ORF1ab encodes one large polyprotein that then breaks up at specific sequences to yield an array of non-structural viral proteins (NSPs), the other ORFs encode structural and accessory proteins. The 3’ end has a polyadenylated UTR. The NSPs arise by translation of the genome early in the replication cycle, forming the replication-transcription complex (RTC).
The other ORFs come from the nested subgenomic RNA (sgRNA) molecules transcribed within the RTC from the negative-sense RNA intermediate strand. These express the viral structural and accessory proteins, required for successful viral particle assembly along with genomic RNA replication. Each ORF comes after a body TRS (TRS-B).
Thus, the template followed by the RTC switches at each TRS-B sequence, causing intermediate negative-sense segments of sgRNA to form. At each of these sites, the 5ʹ-UTR TRS-L fuses with a TRS-B immediately preceding the ORF of interest. Thus, there are multiple sgRNAs that partially overlap, with different lengths from ~200nt to over 8000nt.
These sgRNAs act as the template for the synthesis of positive-strand coding sgRNAs. These have a 5’ cap added to them so that they can be translated into structural and accessory proteins.
In order to examine these complex transcripts, long-read sequencing technologies such as nanopore direct RNA sequencing (DRS) is used. With this technique, sequencing is carried out within the cell without reverse transcription or amplification, by simply measuring the ionic current flows that result from the passage of nucleotides through a nanopore.
This helps obtain full-length transcripts that make it easier to assemble such complex RNA molecules. However, these do not lead to the detection of the 5’ cap, which makes it impossible to distinguish fragments of transcripts from the real full-length transcript. Therefore, they cannot help identify or quantify the full-length coding transcript.
Secondly, RNA breakdown produces abundant reads mapping to the 3’ region, as with all sgRNAs, which makes it hard to accurately quantify sgRNA expression.
For this reason, the current study used a new technique capable of identifying capped full-length RNAs, called Nanopore ReCappable Sequencing (NRCeq). This technique adds a 5ʹ cap-linked RNA sequencing adapter to replace the endogenous 5’ cap. The aim was to achieve completely annotated sets of sgRNA for this virus, showing all the capping sites across the full length of the 30 kb genome.
What did the study show? The results showed that discontinuous transcription is the norm for this virus. This NRCEQ tool appears to capture the full-length capped viral sgRNA transcripts in their complexity.
Their work resulted in an annotation of capping sites across the viral genome that could allow the RNA start sites to be identified without the need to assemble the whole transcriptome. This is the first time such data has become available and could be very useful in studying SARS-CoV-2 transcription.
The researchers assembled a de novo transcriptome that includes all previously annotated ORFs. This was also confirmed and refined using bioinformatics that picks up deletions as well as non-canonical novel sgRNAs. Non-canonical sgRNAs could be due to optimization errors, though some are undoubtedly present and may regulate RNA structure, stability, or interactions with protein.
The results support an sgRNA for ORF10, which has been controversial, with over 100 reads for the leader-to-body junction of this ORF, as well as TRS-B sequences. They found a novel canonical sgRNA that encodes a shortened version of the nucleocapsid protein, called ORF9d. The significance of this ORF is still to be known.
One of these shows a leader-to-body junction site that has never been noted before. The researchers also noted that this technique falsely reports a lower spike ORF expression, unlike standard DRS data.
They were able to get accurately estimated values for sgRNA expression in various cell lines and for various viral isolates. This shows that the longer the sgRNA, the lower their expression, probably because longer sgRNAs are less likely to switch templates at each TRS-B sequence it comes across.
Conclusion NRCeq is a robust technique that permits assembly and quantification of complex transcriptomes . The data generated in this work constitute a useful resource for the scientific community and provide important insights into the mechanisms that regulate the transcription of SARS-CoV-2 sgRNAs .”
*Important notice bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.