This long divergence period suggests there are unsampled virus lineages circulating in horseshoe bats that have zoonotic potential due to the ancestral position of the human-adapted contact residues in the SARS-CoV-2 RBD. Collectively our analyses point to bats being the primary reservoir for the SARS-CoV-2 lineage. Another similarity between SARS-CoV and SARS-CoV-2 is their divergence time (4070years ago) from currently known extant bat virus lineages (Fig. 1, vev016 (2015). Sliding window analysis of changes in the patterns of sequence similarity between human SARS-CoV-2, and pangolin and bat coronaviruses as described further in Fig. The inset represents divergence time estimates based on NRR1, NRR2 and NRA3. 21, 15081514 (2015). For weather, science, and COVID-19 . Intraspecies diversity of SARS-like coronaviruses in Rhinolophus sinicus and its implications for the origin of SARS coronaviruses in humans. The canine viral genome was excluded from the Bayesian phylogenetic analyses because temporal signal analyses (see below) indicated that it was an outlier. Note that six of these sequences fall under the terms of use of the GISAID platform. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. This underscores the need for a global network of real-time human disease surveillance systems, such as that which identified the unusual cluster of pneumonia in Wuhan in December 2019, with the capacity to rapidly deploy genomic tools and functional studies for pathogen identification and characterization. Zhou, H. et al. Biol. Specifically, progenitors of the RaTG13/SARS-CoV-2 lineage appear to have recombined with the Hong Kong clade (with inferred breakpoints at 11.9 and 20.8kb) to form the CoVZXC21/CoVZC45-lineage. We say that this approach is conservative because sequences and subregions generating recombination signals have been removed, and BFRs were concatenated only when no PI signals could be detected between them. To evaluate the performance procedure, we confirmed that the recombination masking resulted in (1) a markedly different outcome of the PHI test64, (2) removal of well-supported (bootstrap value >95%) incompatible splits in Neighbor-Net65 and (3) a near-complete reduction of mosaic signal as identified by 3SEQ. Coronavirus Disease 2019 (COVID-19) Situation Report 51 (World Health Organization, 2020). performed recombination analysis for non-recombining regions1 and 2, breakpoint analysis and phylogenetic inference on recombinant segments. Sequence similarity. As illustrated by the dashed arrows, these two posteriors motivate our specification of prior distributions with standard deviations inflated 10-fold (light color). The lineage B.1 has been the major basal and widespread lineage from the initial SARS-CoV-2 spread and it became the more prevalent lineage in Colombia ( 13 ), while the B.1.111 lineage, first detected in the USA from a sample collected on March 7, 2020 and subsequently in Colombia on March 13, 2020 is currently circulating and mainly represented Of the countries that have contributed SARS-CoV-2 data, 30% had genomes of this lineage. A counting renaissance: combining stochastic mapping and empirical Bayes to quickly detect amino acid sites under positive selection. Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins. Viral metagenomics revealed Sendai virus and coronavirus infection of Malayan pangolins (Manis javanica). Trends Microbiol. The genetic distances between SARS-CoV-2 and Pangolin Guangdong 2019 are consistent across all regions except the N-terminal domain, implying that a recombination event between these two sequences in this region is unlikely. We named the length-sorted BFRs as: BFRA (ntpositions 13,29119,628, length=6,338nt), BFRB (ntpositions 3,6259,150, length=5,526nt), BFRC (ntpositions 9,26111,795, length=2,535nt), BFRD (ntpositions 27,70228,843, length=1,142nt) and six further regions (EJ). The extent of sarbecovirus recombination history can be illustrated by five phylogenetic trees inferred from BFRs or concatenated adjacent BFRs (Fig. These rate priors are subsequently used in the Bayesian inference of posterior rates for NRR1, NRR2, and NRA3 as indicated by the solid arrows. Figure 1 (top) shows the distribution of all identified breakpoints (using 3SEQs exhaustive triplet search) by the number of candidate recombinant sequences supporting them. performed Srecombination analysis. 5, 536544 (2020). and X.J. Time-measured phylogenetic reconstruction was performed using a Bayesian approach implemented in BEAST42 v.1.10.4. =0.00025. A novel bat coronavirus closely related to SARS-CoV-2 contains natural insertions at the S1/S2 cleavage site of the Spike protein. This dataset comprises an updated version of that used in Hon et al.15 and includes a cluster of genomes sampled in late 2003 and early 2004, but the evolutionary rate estimate without this cluster (0.00175 substitutions per siteyr1 (0.00117,0.00229)) is consistent with the complete dataset (0.00169 substitutions per siteyr1, (0.00131,0.00205)). 6, eabb9153 (2020). You are using a browser version with limited support for CSS. Removal of five sequences that appear to be recombinants and two small subregions of BFRA was necessary to ensure that there were no phylogenetic incongruence signals among or within the three BFRs. 3 Priors and posteriors for evolutionary rate of SARS-CoV-2. Suchard, M. A. et al. Sorting these breakpoint-free regions (BFRs) by length results in two segments >5kb: an ORF1a subregion spanning nucleotides (nt) 3,6259,150 and the first half of ORF1b spanning nt13,29119,628 (sequence numbering given in Source Data, https://github.com/plemey/SARSCoV2origins). 24, 490502 (2016). 725422-ReservoirDOCS). Chernomor, O. et al. PubMed Central master 4 branches 94 tags Code AngieHinrichs Add entries for pangolin-data/-assignment 1.18.1.1 ( #512) ad16752 4 days ago 990 commits .github/ workflows Update pangolin.yml 7 months ago docs docs need guide tree now 3 years ago pangolin The origins we present in Fig. To gauge the length of time this lineage has circulated in bats, we estimate the time to the most recent common ancestor (TMRCA) of SARS-CoV-2 and RaTG13. In this approach, we considered a breakpoint as supported only if it had three types of statistical support: from (1) mosaic signals identified by 3SEQ, (2) PI signals identified by building trees around 3SEQs breakpoints and (3) the GARD algorithm35, which identifies breakpoints by identifying PI signals across proposed breakpoints. Bioinformatics 28, 32483256 (2012). PubMed 16, e1008421 (2020). Bayesian evolutionary rate and divergence date estimates were shown to be consistent for these three approaches and for two different prior specifications of evolutionary rates based on HCoV-OC43 and MERS-CoV. We find that the sarbecovirusesthe viral subgenus containing SARS-CoV and SARS-CoV-2undergo frequent recombination and exhibit spatially structured genetic diversity on a regional scale in China. Virological.org http://virological.org/t/ncov-2019-codon-usage-and-reservoir-not-snakes-v2/339 (2020). Sign up for the Nature Briefing newsletter what matters in science, free to your inbox daily. BEAGLE 3: improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Complete genome sequence data were downloaded from GenBank and ViPR; accession numbers of all 68sequences are available in Supplementary Table 4. These are in general agreement with estimates using NRR2 and NRA3, which result in divergence times of 1982 (19482009) and 1948 (18791999), respectively, for SARS-CoV-2, and estimates of 1952 (19061989) and 1970 (19321996), respectively, for the divergence time of SARS-CoV from its closest known bat relative. Methods Ecol. & Bedford, T. MERS-CoV spillover at the camelhuman interface. This study provides an integration of existing classifications and describes evolutionary trends of the SARS-CoV . Although the human ACE2-compatible RBD was very likely to have been present in a bat sarbecovirus lineage that ultimately led to SARS-CoV-2, this RBD sequence has hitherto been found in only a few pangolin viruses. EPI_ISL_410538, EPI_ISL_410539, EPI_ISL_410540, EPI_ISL_410541 and EPI_ISL_410542) for the use of sequence data via the GISAID platform. Zhang, Y.-Z. CAS Lancet 395, 565574 (2020). Dis. Biol. Natl Acad. Conservatively, we combined the three BFRs >2kb identified above into non-recombining region1 (NRR1). Thank you for visiting nature.com. SARS-CoV-2 genetic lineages in the United States are routinely monitored through epidemiological investigations, virus genetic sequence-based surveillance, and laboratory studies. We aimed to analyze 3 naso-oropharyngeal swab samples collected between August and December 2021 to describe the amino acid changes present in the sequence reads that may have a role in the emergence of new . This is notable because the variable-loop region contains the six key contact residues in the RBD that give SARS-CoV-2 its ACE2-binding specificity27,37. These shy, quirky but cute mammals are one of the most heavily trafficked yet least understood animals in the world. From this perspective, it may be useful to perform surveillance for more closely related viruses to SARS-CoV-2 along the gradient from Yunnan to Hubei. collected SARS-CoV data and assisted in analyses of SARS-CoV and SARS-CoV-2 data. Holmes, E. C., Rambaut, A. Menachery, V. D. et al. Eden, J.-S., Tanaka, M. M., Boni, M. F., Rawlinson, W. D. & White, P. A. Recombination within the pandemic norovirus GII.4 lineage. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is the causative agent for the current coronavirus disease (COVID-19) pandemic that has affected more than 35 million people and caused . Lie, P., Chen, W. & Chen, J.-P. With horseshoe bats currently the most plausible origin of SARS-CoV-2, it is important to consider that sarbecoviruses circulate in a variety of horseshoe bat species with widely overlapping species ranges57. 3) clusters with viruses from provinces in the centre, east and northeast of China. Rev. Lond. The latter was reconstructed using IQTREE66 v.2.0 under a general time-reversible (GTR) model with a discrete gamma distribution to model inter-site rate variation. Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? Here, we analyse the evolutionary history of SARS-CoV-2 using available genomic data on sarbecoviruses. 84, 31343146 (2010). The coronavirus genome that these researchers had assembled, from pangolin lung-tissue samples, contained some gene regions that were ninety-nine per cent similar to equivalent parts of the SARS . We thank T. Bedford for providing M.F.B. PubMed 5). https://doi.org/10.1093/molbev/msaa163 (2020). Sibling lineages to RaTG13/SARS-CoV-2 include a pangolin sequence sampled in Guangdong Province in March 2019 and a clade of pangolin sequences from Guangxi Province sampled in 2017. 87, 62706282 (2013). Since experts have suggested that pangolins may be the reservoir species for COVID-19, the scaly anteater has been catapulted into headlines, news reports, and conversationsand some are calling COVID-19 "the revenge of the . Phylogenies of subregions of NRR1 depict an appreciable degree of spatial structuring of the bat sarbecovirus population across different regions (Fig. 4, vey016 (2018). acknowledges support by the Research FoundationFlanders (Fonds voor Wetenschappelijk OnderzoekVlaanderen (nos. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic, https://doi.org/10.1038/s41564-020-0771-4. SARS-like WIV1-CoV poised for human emergence. Nature 558, 180182 (2018). & Li, X. Crossspecies transmission of the newly identified coronavirus 2019nCoV. 1 Phylogenetic relationships in the C-terminal domain (CTD). Nature 503, 535538 (2013). All sequence data analysed in this manuscript are available at https://github.com/plemey/SARSCoV2origins. 5. Because these subclades had different phylogenetic relationships in regionD (Supplementary Fig. 91, 10581062 (2010). Google Scholar. By mid-January 2020, the virus was spreading widely within Hubei province and by early March SARS-CoV-2 was declared a pandemic8. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Avian influenza a virus (H7N7) epidemic in The Netherlands in 2003: course of the epidemic and effectiveness of control measures. 30, 21962203 (2020). RegionC showed no PI signals within it. 92, 433440 (2020). P.L. 4), that region and shorter BFRs were not included in combined putative non-recombinant regions. Identifying the origins of an emerging pathogen can be critical during the early stages of an outbreak, because it may allow for containment measures to be precisely targeted at a stage when the number of daily new infections is still low. Furthermore, the other key feature thought to be instrumental in the ability of SARS-CoV-2 to infect humansa polybasic cleavage site insertion in the Sproteinhas not yet been seen in another close bat relative of the SARS-CoV-2 virus. 62,63), the GTR+ model and 100bootstrap replicateswas inferred for each BFR >500nt. Phylogenetic trees and exact breakpoints for all ten BFRs are shown in Supplementary Figs. Virus Evol. (Yes, Pango is a tongue-in-cheek reference to pangolins, which were briefly suspected to have had a role in the coronavirus's originseveral of the team's computational tools are named after. Anderson, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The proximal origin of SARS-CoV-2. Scientists defined the pangolin lineage of this variant to be B.1.1.523 and it was originally recognized as a variant under monitoring on July 14, 2021. It allows a user to assign a SARS-CoV-2 genome sequence the most likely lineage (Pango lineage) to SARS-CoV-2 query sequences. 110. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. BFRs were concatenated if no phylogenetic incongruence signal could be identified between them. It is RaTG13 that is more divergent in the variable-loop region (Extended Data Fig. When the first genome sequence of SARS-CoV-2, Wuhan-Hu-1, was released on 10January 2020 (GMT) on Virological.org by a consortium led by Zhang6, it enabled immediate analyses of its ancestry. Early detection via genomics was not possible during Southeast Asias initial outbreaks of avian influenza H5N1 (1997 and 20032004) or the first SARS outbreak (20022003). Temporal signal was tested using a recently developed marginal likelihood estimation procedure41 (Supplementary Table 1). Biazzo et al. All three approaches to removal of recombinant genomic segments point to a single ancestral lineage for SARS-CoV-2 and RaTG13. 4), but also by markedly different evolutionary rates. Influenza viruses reassort17 but they do not undergo homologous recombination within RNA segments18,19, meaning that origins questions for influenza outbreaks can always be reduced to origins questions for each of influenzas eight RNA segments. Because 3SEQ identified ten BFRs >500nt, we used GARDs (v.2.5.0) inference on 10, 11 and 12 breakpoints. The pangolin coronaviruses show lower similarity to SARS-CoV-2 than bat coronavirus RaTG13 across the whole genome, but higher similarity in the spike receptor binding domain, although the similarity at either scale remains too low to implicate . The difficulty in inferring reliable evolutionary histories for coronaviruses is that their high recombination rate48,49 violates the assumption of standard phylogenetic approaches because different parts of the genome have different histories. 6, e14 (2017). The sizes of the black internal node circles are proportional to the posterior node support. Posterior means with 95% HPDs are shown in Supplementary Information Table 2. Now, the two researchers used genomic sequencing to compare the DNA of the new coronavirus in humans with that in animals and found a 99% match with pangolins. Schierup, M. H. & Hein, J. Recombination and the molecular clock. Gray inset shows majority rule consensus trees with mean posterior branch lengths for the two regions, with posterior probabilities on the key nodes showing the relationships among SARS-CoV-2, RaTG13, and Pangolin 2019. Preprint at https://doi.org/10.1101/2020.04.20.052019 (2020). In regionA, we removed subregion A1 (ntpositions 3,8724,716 within regionA) and subregion A4 (nt1,6422,113) because both showed PI signals with other subregions of regionA. A deep dive into the genetics of the novel coronavirus shows it seems to have spent some time infecting both bats and pangolins before it jumped into humans, researchers said . NTD, N-terminal domain; CTD, C-terminal domain. The web application was developed by the Centre for Genomic Pathogen Surveillance. 36, 7597 (2002). The presence in pangolins of an RBD very similar to that of SARS-CoV-2 means that we can infer this was also probably in the virus that jumped to humans. Preprint at https://doi.org/10.1101/2020.05.28.122366 (2020). Microbes Infect. It is available as a command line tool and a web application. 35, 247251 (2018). Dudas, G., Carvalho, L. M., Rambaut, A. Unlike other viruses that have emerged in the past two decades, coronaviruses are highly recombinogenic14,15,16. Boni, M. F., de Jong, M. D., van Doorn, H. R. & Holmes, E. C. Guidelines for identifying homologous recombination events in influenza A virus. J. Virol. GitHub - cov-lineages/pangolin: Software package for assigning SARS-CoV-2 genome sequences to global lineages. Posterior rate distributions for MERS-CoV (far left) and HCoV-OC43 (far right) using BEAST on n=27 sequences spread over 4 years (MERS-CoV) and n=27 sequences spread over 49 years (HCoV-OC43). Press, H.) 3964 (Springer, 2009). Posterior distributions were approximated through Markov chain Monte Carlo sampling, which were run sufficiently long to ensure effective sampling sizes >100. A tag already exists with the provided branch name. (2020) with additional (and higher quality) snake coding sequence data and several miscellaneous eukaryotes with low genomic GC content failed to find any meaningful clustering of the SARS-CoV-2 with snake genomes (a). Holmes, E. C., Dudas, G., Rambaut, A. Nat. If stopping an outbreak in its early stages is not possibleas was the case for the COVID-19 epidemic in Hubeiidentification of origins and point sources is nevertheless important for containment purposes in other provinces and prevention of future outbreaks. Liu, P. et al. 4). Curr. pango-designation Public Repository for suggesting new lineages that should be added to the current scheme Python 968 73 pangolin Public Software package for assigning SARS-CoV-2 genome sequences to global lineages. Note that breakpoints can be shared between sequences if they are descendants of the same recombination events. Gorbalenya, A. E. et al. Discovery and genetic analysis of novel coronaviruses in least horseshoe bats in southwestern China. We extracted a similar number (n=35) of genomes from a MERS-CoV dataset analysed by Dudas et al.59 using the phylogenetic diversity analyser tool60 (v.0.5). Google Scholar. Sequencing from Malayan pangolins collected during anti-smuggling operations in southern China detected coronavirus lineages related to SARS-CoV-2. Virus Evol. Mol. & Holmes, E. C. A genomic perspective on the origin and emergence of SARS-CoV-2. 13, e1006698 (2017). After removal of A1 and A4, we named the new region A. PI signals were identified (with bootstrap support >80%) for seven of these eight breakpoints: positions 1,684, 3,046, 9,237, 11,885, 21,753, 22,773 and 24,628. SARS-CoV-2 itself is not a recombinant of any sarbecoviruses detected to date, and its receptor-binding motif, important for specificity to human ACE2 receptors, appears to be an ancestral trait shared with bat viruses and not one acquired recently via recombination. Genet. J. Virol. 2, bottom) show that SARS-CoV-2 is unlikely to have acquired the variable loop from an ancestor of Pangolin-2019 because these two sequences are approximately 1015% divergent throughout the entire Sprotein (excluding the N-terminal domain). We extracted a total of 2189 full-length SARS-CoV-2 viral genomes from various states of India from the EpiCov repository of the GISAID initiative on 12 June 2020. and JavaScript. While such models have recently been made available, we lack the information to calibrate the rate decline over time (for example, through internal node calibrations44). Kosakovsky Pond, S. L., Posada, D., Gravenor, M. B., Woelk, C. H. & Frost, S. D. W. Automated phylogenetic detection of recombination using a genetic algorithm. T.L. Because coronaviruses are known to be highly recombinant, we used three different approaches to identify non-recombinant regions for use in our Bayesian time-calibrated phylogenetic inference. Boni, M. F., Posada, D. & Feldman, M. W. An exact nonparametric method for inferring mosaic structure in sequence triplets. Extended Data Fig. Extensive diversity of coronaviruses in bats from China. Evol. Epidemiology, genetic recombination, and pathogenesis of coronaviruses. performed recombination analysis for non-recombining alignment3, calibration of rate of evolution and phylogenetic reconstruction and dating. B 281, 20140732 (2014). Unfortunately, a response that would achieve containment was not possible. Aiewsakun, P. & Katzourakis, A. Time-dependent rate phenomenon in viruses. As a proxy, it would be possible to model the long-term purifying selection dynamics as a major source of time-dependent rates43,44,52, but this is beyond the scope of the current study. The unsampled diversity descended from the SARS-CoV-2/RaTG13 common ancestor forms a clade of bat sarbecoviruses with generalist propertieswith respect to their ability to infect a range of mammalian cellsthat facilitated its jump to humans and may do so again. 68, 10521061 (2019). https://doi.org/10.1038/s41564-020-0771-4, DOI: https://doi.org/10.1038/s41564-020-0771-4. The divergence time estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent among the three approaches we use to eliminate the effects of recombination in the alignment. "This is an extremely interesting . Evol. In December 2019, a cluster of pneumonia cases epidemiologically linked to an open-air live animal market in the city of Wuhan (Hubei Province), China1,2 led local health officials to issue an epidemiological alert to the Chinese Center for Disease Control and Prevention and the World Health Organizations (WHO) China Country Office. T.T.-Y.L. and T.A.C. Developed by the Centre for Genomic Pathogen Surveillance. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. Nat. To begin characterizing any ancestral relationships for SARS-CoV-2, NRRs of the genome must be identified so that reliable phylogenetic reconstruction and dating can be performed. Virus Evol. The red and blue boxplots represent the divergence time estimates for SARS-CoV-2 (red) and the 2002-2003 SARS-CoV (blue) from their most closely related bat virus, with the light- and dark-colored versions based on the HCoV-OC43 and MERS-CoV centered priors, respectively. Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. Divergence time estimates based on the HCoV-OC43-centred rate prior for the separate BFRs (Supplementary Table 3) show consistency in TMRCA estimates across the genome. The presence of SARS-CoV-2-related viruses in Malayan pangolins, in silico analysis of the ACE2 receptor polymorphism and sequence similarities between the Receptor Binding Domain (RBD) of the spike proteins of pangolin and human Sarbecoviruses led to the proposal of pangolin as intermediary. Trends Microbiol. 1. Our results indicate the presence of a single lineage circulating in bats with properties that allowed it to infect human cells, as previously described for bat sarbecoviruses related to the first SARS-CoV lineage29,30,31. However, the coronavirus isolated from pangolin is similar at 99% in a specific region of the S protein, which corresponds to the 74 amino acids involved in the ACE (Angiotensin Converting Enzyme . Python 379 102 pangoLEARN Public Store of the trained model for pangolin to access. One study suggests that over a century ago, one lineage of coronavirus circulating in bats gave rise to SARS-CoV-2, RaTG13 and a Pangolin coronavirus known as Pangolin-2019, Live Science .