Update 7 | 2020.05.13

Yale SARS-CoV-2 Genomic Surveillance Initiative

This week, we’ve worked on reanalyzing our genomes with a new approach, now that we are seeing a lot of transmission within Connecticut (CT). While our previous analyses provided valuable information on whether or not there is a lot of transmission within CT and from where sequences were introduced outside of CT, we lacked details around how transmission was occurring within CT. For this report, we used the existing 121 genomes, but with location data showing more precisely from where the samples were collected. This means that not only can we map whether cases were acquired in CT or introduced from elsewhere, but we can gain some insight into how and where the spread within CT is happening. Here, we present the preliminary version of this analysis; the interactive nextstrain can also be found here.

As shown in Figure 1, most genomes sequenced by our team were collected in the southwestern region of Connecticut. We can’t capture all of the spread happening in CT with 121 genomes, but we can understand the big picture based on the analysis below. Importantly, our dataset includes genomes collected through the end of April. This is a good start, but as we collect and sequence more samples, we’ll gain a clearer and more up-to-date idea of how SARS-CoV-2 is circulating in our state. For a general overview on why knowing this information is important and how Genomic Epidemiology works, please check out this page.

⚠️ WARNING: These results should be considered preliminary, as they may change in light of new data.


Previously, we explained how SARS-CoV-2, the virus that causes COVID-19, was introduced into CT from New York and the west coast and is now spreading within our state (Figure 1).

Figure 1. Set of genomes sequenced by the Yale SARS-CoV-2 Genomic Surveillance Initiative. Each circle represents one or more genomes (with size being proportional to sampling), and the edges connecting them depict events of spread between countries, states and towns. The resolution in the CT map is based on a scheme that takes CT zip code areas and the population of those areas to define different geographic locations.

What’s new?

Transmission within Connecticut

This week, we have improved the geographic data showing the sampling origin for all of our genomes, which means we can study how the virus is spreading between different parts of the state. In the same way that we connected virus genomes collected in CT to those collected in New York, we can now connect virus genomes collected in different parts of our state with one another. This gives us a much better idea of how transmission is occurring in CT.

Massachusetts & Connecticut

Figure 2. Relationship among genomes collected in Connecticut (mint green), New York (green), and Massachusetts (pink).

Surprisingly, we have not yet found evidence of interstate spread between CT and Massachusetts (MA) with our genomic data. The MA genomes in our dataset cluster  in a group in which only one CT genome is observed (Figure 2). This cluster most likely originated from an independent introduction from Western Europe. Most CT genomes are clustered within other groups, like the “NY-Clade”, which originated from an introduction from Europe.  If CT-MA interstate spread had occurred, we would see MA genomes interspersed among CT genomes in distinct groups all over the tree.

What does this mean?

Evidence-based policies

Social distancing measures must be sustainable. As these measures continue into the summer, people are growing weary of restrictions.  Outbreaks driven by local transmission are still unfolding in major urban centers, and most of the population in  CT (and in the US) is still susceptible. And considering the prevalence of state-to-state viral spread, it is clear that implications of social distancing policies transcend state borders. Therefore, as  we  look towards  relaxing social distancing measures, policymakers must work across municipal and state boundaries and implement data-driven responses using epidemiological data and information on viral dynamics to plan safe reopening strategies. Such data, coupled with large scale testing, are essential to make smart decisions in when and how we open to limit viral transmission and the risk of resurgence.

The Bottom Line

Using genomes sampled from residents of CT, genomic epidemiology reveals that over the first few weeks, outbreaks in CT (and all over the USA) were driven  by state-to-state and coast-to-coast spread of SARS-CoV-2. In more recent weeks, there have been multiple independent viral lineages circulating within CT, especially in urban hubs near the coastline towards New York. This was especially true for  the area that spans the western towns of Ridgefield and Easton and those further east, like Haddam, Westbrook, and the Greater New Haven area.

Outbreaks are still unfolding in major urban centers, and most of the population in the country is still susceptible to the virus. It is critical that we closely track the spread of the virus using epidemiological data to develop safe reopening strategies.

The Technical Details

SARS-CoV-2 Lineages found in Connecticut

Lineage Sub-lineage Introduced to CT from Sampling Dates Number of sequences
A A.1 Initially introduced to the Western U.S./Canada at least twice, now sustained in CT March 8 to April 9 21
None East Asia/Oceania to U.S. Northeast March 13 to April 6 4
B B.1 (mostly B.1.3 and B.1.1) New York March 11 to April 17 95
B.2 Southeastern Asia to U.S. northeast March 6 to April 10 3

Two main lineages in CT

Genomes sequenced by our team can be found in both major lineages of SARS-CoV-2 lineages: A and B. A lineage is a group of viruses that share a common ancestor. These lineages were proposed by Rambaut and colleagues, and are labeled in the interactive phylogeny on our Nextstrain page. In lineage A, most of the viral genomes that our lab has sequenced cluster with lineage A1, mainly within the WA-clade (see Figure 3), a group of viruses that so far is traced back to common ancestors from Washington state.

Figure 3. SARS-CoV-2 lineage A1 and its spread coast-to-coast. Viruses from this lineage are found in Canada, USA-West, Midwest, South, and Northeast.

Within the B lineage, the biggest group of SARS-CoV-2 worldwide, most genomes sequenced in CT belong to sub-lineage B1 (shown in Figure 4). Even within this, the majority of our genomes are mainly found in a group where many genomes from NY state are found (shown as ‘NY-clade’ in Figure 4), and in tight clusters that split off before B1 was introduced to NY, like lineage B.1.11, with an early origin in Western Europe. There are also smaller clusters that can be seen in lineage B2, which can be traced back to a common ancestor genome from Southern Asia.

Figure 4. The NY-clade. In this sub-lineage we can find 79 out of 117 (nearly 67%) of all CT genomes sequenced by our team. The lineage B1 genomes we sequenced are the result of independent introductions from NY, followed by community spread (mainly within the New Haven County).

Transmission within Connecticut

After multiple inductions of SARS-CoV-2 from various (likely domestic) sources, the most recent genomes (like the ones released in our updates 4, 5 and 6) point to within-state transmission, as can be seen in Figure 5. This figure shows multiple introductions of SARS-CoV-2 from places like New York to CT, followed by the spread of the virus among towns in CT, likely from mid-March. The within state transmission is occurring from both lineage A1 and lineage B1. Higher geographic resolution allows us to see more specifically where outbreaks within CT come from and what areas in CT are experiencing introductions from other states.

Figure 5. Spread of the virus causing COVID-19 in Connecticut. Events of viral spread started to be more evident from Mid-March.


We used 121 genomes that  we previously sequenced using a MinION platform, following an ARTIC protocol. To perform preliminary analysis, we also downloaded other 653 genomes available on GISAID, from around the world and the US, to uncover recent patterns of viral spread within and from Northeastern USA in the past weeks. Sequence alignment and phylogenetic analysis were performed using a nextstrain pipeline. Geographic information for each sequence was aggregated by zip code areas with more than 50,000 habitants, mostly matching existing CT town and county borders.

Data availability

The directories consensus_genomes and metadata in our GitHub repository contain all of our current SARS-COV-2 genomes and metadata. The directory auspice contains a JSON file that was produced using the nextstrain pipeline. A list of GISAID accession numbers of genomes used in this report can be downloaded on a link at the bottom of our Nextstrain page.


Mary Petrone, Anne Wyllie, Chantal Vogels, Ed Courchiane, Sarah Prophet, and Isabel Ott performed the viral RNA extractions. Tara Alpert and Joseph Fauver prepared samples for sequencing and assembled the SARS-CoV-2 genomes. Cole Jensen and Anderson Brito performed the phylogenetic analysis. Cole Jensen, Chaney Kalinich, Mary Petrone, Anderson Brito and Nathan Grubaugh wrote and reviewed this report. Chaney Kalinich and Peter Neugebauer developed and maintain this COVIDTrackerCT website. Nathan Grubaugh leads the Yale SARS-CoV-2 Genomic Epidemiology Initiative. Finally, we also thank the authors of the genomes in our complementary dataset for making their data freely available to other researchers: a full list of authors is provided at the bottom of our dedicated nextstrain page.

Grubaugh Lab | Yale School of Public Health (YSPH) | https://grubaughlab.com/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s