Leer esta página en español

By sequencing many cases of SARS-CoV-2, the virus that causes COVID-19, we can learn about how, where, and when it is transmitted. Sequencing is a process used to find out the order of DNA bases in a genome. All viruses (and every living thing, for that matter) makes small errors when replicating their genomes. While these errors (called “mutations”) are generally harmless, future infections will all end up with the same error. As the virus accumulates these errors, we can sequence them and use common errors to link cases together.

Lead investigator Nathan Grubaugh explains genomic epidemiology in the context of COVID-19

What we can – and can’t – learn from pathogen genomics

Genomics can be used to answer a wide variety of questions. In fact, the novel coronavirus was first discovered when a large group of people suddenly became sick with a strange pneumonia, and researchers at the Shanghai Public Health Clinical Center & School of Public Health sequenced samples from the patients to identify what was making them sick.  That genome is still used by many others as a reference. We’re using genomics to answer a few different questions related to transmission in Connecticut (and surrounding states), such as:

  • Is SARS-CoV-2 frequently imported from other outbreaks, or are most cases related to SARS-CoV-2 already circulating in Connecticut?
  • When did the epidemic begin in Connecticut?
  • Where was the virus introduced from?
  • What factors are contributing to spread within Connecticut?

While genomics are a powerful tool for understanding the SARS-CoV-2 pandemic, there are some important caveats to consider. First, and most important, the sequences generated are not a random sample of cases and are not representative of all of the outbreaks. The virus will only be sequenced from a small fraction of COVID-19 cases. These cases are typically closer to research or public health institutions with sequencing capabilities. This means that, while we may make general inferences about geographic spreading patterns, we can’t draw exact conclusions about where a virus came from without data from many different places. For example, that if the closest genetic relative of a SARS-CoV-2 virus sequenced in location A is an earlier sequence from in location B, we know there were lots of cases in between. The virus may have traveled from B to C, C to D, and finally D to A.

From Trevor Bedford

In addition to this, sequencing isn’t perfect. It is standard to go back and sequence samples again if we didn’t get the entire genome in one go, so sequences will be updated. Data on this site should be considered preliminary. Because our focus is on Connecticut, we don’t include all of the sequences that have been generated worldwide. We choose representative samples of outbreaks in other countries and regions in the US in order to figure out how often SARS-CoV-2 is being introduced, but we can’t draw any conclusions regarding transmission outside of Connecticut with this data.

How does it work?

The exact steps we will take for SARS-CoV-2 genomics might vary a little based on the question we want to answer, but follows the same general steps. The approach we are using to sequence is based on the ARTIC Network protocol and is similar to what was used for real-time sequencing during the 2013-2016 Ebola virus epidemic. If you want more details, check out our protocols. We were able to generate a sequence of the first detected case in Connecticut within 14 hours, and plan to release more sequences and a discussion of what they mean on a weekly basis moving forward.

Once we sequence samples, we build an evolutionary tree.  An evolutionary tree, or phylogenetic tree, is a branching diagram that shows the evolutionary relationship between sequences. Because the virus mutates at a fairly standard rate, these mutations can serve as a “molecular clock” where the number of differences between sequence samples correspond to the amount of time since they had a common ancestor. This is often referred to as the most recent common ancestor or MRCA. The most recent common ancestor, which we estimate with our data, can tell us about when and where a virus was introduced.

We can also make a map showing how the virus spreads geographically. Because sequencing across the globe has not been done at the same amount, we can only make general conclusions about where the virus is coming from outside of Connecticut, like we did in our paper on the Coast-to-coast spread of SARS-CoV-2 during the early epidemic in the United  States. However, because we are sampling many viruses throughout the state, we will be able to learn about how it’s spreading in Connecticut and factors that are increasing or preventing transmission.

Why does this matter?

Tracking the spread of SARS-CoV-2 in Connecticut can allow for more educated decisions on how to control its spread. The geographic disease transmission patterns can tell us whether interventions like border closures and decreasing daily work commutes are working, and provide insight into how the virus is still spreading. In addition, as the pandemic declines, the routine evaluation of genome sequences can provide insight into whether new introductions are still occurring or if new cases are still the result of local transmission in the community, which could tell us whether it is safe to start lifting restrictions.