The origin of SARS-CoV-2 is knowable.
At the highest level, there are two theories: either SARS-CoV-2 emerged as a natural, zoonotic event affiliated with non-scientific human activities like the animal trade, or SARS-CoV-2 emerged as a consequence of scientific human activities.
When we zoom in, however, the specific details of these theories are not entirely resolved. What looks like two disjoint possibilities from 40,000 feet become, upon closer inspection, two clusters of scenarios, knowledge graphs of one fact implying another. The evidence assembled by the world’s scientists and forensic sleuths doesn’t tilt a simple scale comparing two theories, but rather it differentially lights up the streets of two theoretical metropoles and slowly but surely illuminates the true path SARS-CoV-2 took from a bat to a human.
When most people say one theory or another is “more likely”, they often do so without ever calculating likelihoods or, where there may be likelihoods scattered around papers, they don’t critically evaluate the methods used to calculate likelihoods.
Others may try to estimate likelihoods with common Bayesian methods from forensics and theoretical sciences. Bayesian methods boil down to establishing some “prior” or your initial probability estimate of various theories being true, estimating the conditional probabilities of the evidence under the various theories, and then combining this information to get estimates of the posterior probabilities of various theories being true. Bayesian approaches are superior in that they actually require we contend with likelihood estimation. At a bare minimum, Bayesian approahces force researchers to justify their priors (unjustifiable priors reveal biases), estimate the probability of a furin cleavage site emerging in a bat SARSr-CoV (unjustifiable estimates again reveal biases), estimate the probability of a SARSr-CoV emerging in Wuhan, and repeat this process for every piece of evidence. Bayesian methods force researchers to formally weigh the evidence against their priors with transparent methods - one can’t just brush aside weighty inconvenient truths nor can one tilt the scales with biased estimates of conditional probabilities.
If what I’m talking about sounds too technical, please forgive me. It’s about to get more technical. There is simply no way to know the origins of SARS-CoV-2 without deep-diving into extremely technical topics and knowing the origins of SARS-CoV-2 requires a technical breadth like few other issues in science. Most biologists, for example, could not write down Bayes’ Theorem from the top of their heads, so our Bayesian excursion above is in some sense too technical for even many biology professors.
As Bayes’ Theorem may be too technical for the immediate comprehension of most biologists, most experts familiar with Bayes’ Theorem are not familiar with the field ecological programs for catching animals & sampling them to discover wildlife viruses. Most field ecologists who can catch bats and handle them according to IACUC (International Animal Care and Use Committee) guidelines are not familiar with the molecular biological and virological protocols for cultivating, modifying, and studying viruses in a lab.
To understand the origin of SARS-CoV-2, one must understand all of these topics spanning biology and statistics well-enough to evaluate evidence and estimate likelihoods. In fact, if you really want to know the answer to this battle between theories, one must even be better than most Bayesians. Most Bayesian forensic efforts treat each piece of evidence as independent, but if you’re sufficiently well-versed in biology it’s clear the pieces of evidence are not independent as some pieces of evidence can imply others. In the math expression below, the numbered X’s are our pieces of evidence. If the pieces of evidence are independent when conditioned on a Theory, we can separate the conditional probabilities of all pieces of evidence into the product of conditional probabilities of each piece of evidence.
The assumption of independence underlies why many Bayesian attempts to resolve SARS-CoV-2 origins tabulate pieces of evidence and give them “Bayes factors”. Bayes factors are simply the ratios of conditional probabilities of a piece of evidence under two theories. If we express our set of evidence as X, and have two theories “Lab” and “Zoo”, then the ratio of posterior probabilities P(Theory|X) is calculated below:
If we assume pieces of evidence are independent, then the middle fraction decomposes to
and Bayes factors are all the fractions on the right-hand-side of the equation above. A Bayes factor of 10 for some pieces of evidence would imply the evidence is 10-times more likely to be observed under a lab origin than a zoonotic origin, thus tilting the scales of our posterior beliefs towards a lab origin.
The pieces of evidence in the origin of SARS-CoV-2 can help bring this mathematical expression to life. One piece of evidence on the origin of SARS-CoV-2 is the fact that the first known outbreak of human cases was in Wuhan.
What are the conditional probabilities of this observation under our two theories?
Under a zoonotic origin theory, we’d probably use the methods I was using prior to the COVID-19 pandemic to forecast the location of a zoonotic event. I was working on forecasting the geographic incidence of outbreaks by bat viruses (henipaviruses) and while we can’t pinpoint the exact location of emergence any more than we can pinpoint the exact location of a drop of rain, the probabilities of incidence over larger geographic regions and populations can be estimated just as the probability of rain over some period of time and the estimated amount of rain (in cm) can be estimated for, say, the city you live in tomorrow.
Bat SARSr-CoVs are known to infect bats that live across SE Asia. Below is a figure made by some of my pre-COVID colleagues in the bat-virus world showing the areas where humans overlap with bat species known (or believed/estimated) to be reservoirs of bat SARS-related coronaviruses (bat SARSr-CoVs).
To a first approximation, the colored pixels overlap with the homes of approximately 1 billion people. If we include Wuhan’s population of 10 million people and assume all people living in the colored regions above are equally likely to be infected by a bat SARSr-CoV, then we would estimate a 1% (10 million divided by 1 billion) chance that a SARSr-COV outbreak would first emerge in Wuhan. Of course, the assumption that people are equally likely to be infected by a bat-human contact is not accurate because human-wildlife contacts tend to happen outside of big cities like Wuhan, so our 1% odds is probably generous to zoonotic origin theories. If instead of bat-human contacts we were to estimate the likelihood of the virus emerging near a wet market, we’d have to look at the 40,000 wet markets across SE Asia and this could easily yield a probability less than 1% as well. To be generous to zoonotic origin theory, then, we’ll use:
What, then, is the probability of a Wuhan emergence under a lab origin theory?
How would we estimate this likelihood? Should we consider every biology lab a possible site of emergence, or should we focus on labs that study bat SARSr-CoVs as we focused on bat species that could serve as reservoirs to bat SARSr-CoVs?
Here, we can begin to see how evidence is not independent, and in fact there were hidden assumptions in our reasoning above. Our reasoning above didn’t look at the entire world of pathogens, but instead we focused on a specific clade believed to be ecologically similar to SARS-CoV-2, we focused on their reservoirs, and we focused only on humans that overlap with these reservoirs and not, say, humans that are at risk of acquiring Ebola, lyme disease, Hendra, or other zoonotic pathogens. The zoonotic theory implicitly assumes we’re studying zoonotic spillover of a particular type, namely a SARS-CoV spillover.
In other words, if we were concerned more broadly with zoonotic emergence then we’d have to consider an additional piece of evidence defining the clade of virus that emerged, such as a coronavirus:
or possibly a sarbecovirus:
Since sarbecoviruses are ecologically very different from, say, MERS coronaviruses or alphacoronaviruses, “sarbecovirus” is a reasonable phylogenetic or taxonomic scale to limit our analysis and estimation of likelihoods. We may want to estimate the conditional probability above, or we may want to simply accept that this was a sarbecovirus and all other analyses are conditioned on not just a zoonotic origin or lab origin theory, but these theories AND a sarbecovirus, e.g.
or we estimated the probability of Wuhan emergence given the emergence was zoonotic AND the virus that emerged was a sarbecovirus.
If we do this, however, then we need to reframe our priors by explicitly scoping our theories accordingly. Our theories were not simply zoonotic emergence or a lab origin, but both of these respectively for a sarbecovirus. I’ll continue using Zoo and Lab for each theory, but we’ll note that we’re really talking about a theory that combines Zoo + Sarbecov and Lab + Sarbecov
This may seem asinine to spend so much time moving “sarbvecovirus” from the left-hand side of our conditional probability to the right-hand side, but it’s actually a very important step in our logic especially when we mind this step in subsequent conditional probabilities.
For example, let’s jointly consider two more pieces of evidence for the lab origin: the Wuhan site of emergence AND the DEFUSE grant. Where pieces of evidence are independent, we can simply decompose our conditional probabilities:
The DEFUSE grant is a grant that proposed to study bat sarbecoviruses in Wuhan. Prior to the emergence of SARS-CoV-2, the NY-based non-profit EcoHealth Alliance would sample animals all over the world and somehow (a very important somehow we’ll discuss later) get those samples into labs for further study. In fact, EcoHealth Alliance had sent an estimated 15,000 samples likely to contain coronaviruses to their colleagues at the Wuhan Institute of Virology.
Under a lab origin, we would expect scientists to have somehow financed their endeavor, so we expect something like DEFUSE - a recent grant proposing to move coronaviruses to the city of interest - to be highly likely. We should briefly consider whether the piece of evidence is “DEFUSE” or whether an appropriate piece of evidence is the years-long program shipping bat sarbecoviruses to Wuhan, creating Wuhan as the global hotspot of sarbecoviruses in labs.
If we estimate the conditional probability of a Wuhan origin with logically similar methods that we used for a zoonotic origin, then we can’t give every scientist around the world an equal likelihood of being patient zero, but rather we have to focus on scientists that overlap with sarbecovs during their work. There are a handful of labs around the world that study SARS, many of which are in China, and the most prominent of which is in Wuhan. While most labs studying sarbecoviruses are studying SARS in BSL-3 or BSL-4 environments, the Wuhan Institute of Virology was unique in studying sarbecoviruses in BSL-2. Higher biosafety levels significantly reduce the odds of a laboratory escape, and so we can’t weigh all labs equally.
DEFUSE, and the long research collaborations leading up to DEFUSE, thus informs our estimated likelihood of a Wuhan origin by revealing the extraordinary stockpiles of sarbecoviruses in Wuhan and the documented evidence of this collaboration making reverse genetics systems and enhancing the transmissibility of bat SARSr-CoVs. The combination of DEFUSE and Wuhan is more than simply the products of the two conditional probabilities. We can’t assume independence here as the research in DEFUSE makes a lab origin in Wuhan more likely.
This non-independence grows stronger with the furin cleavage site (FCS).
The DEFUSE grant proposed to insert an FCS in a sarbecovirus in Wuhan.
SARS-CoV-2 is a sarbecovirus that emerged in Wuhan with an FCS.
An FCS has never been seen before nor since in a wild sarbecovirus. If we’re again being generous to the zoonotic origin we might look at the 1,000 years of evolutionary time on the sarbecovirus evolutionary tree and, to be circular in our logic and beg the question, assume the FCS in SARS-CoV-2 was natural, thus we have 1 FCS in 1,000 years evolutionary time. On any given point in time when a sarbecovirus emerged, we’d then estimate
Under a zoonotic origin theory, all of the pieces of evidence believed to be consistent with a lab origin are coincidences. Thus, it’s okay for us to assume independence of these observations under the zoonotic theory.
Under the lab origin theory, however, our theory involves a specific research program and the evidence is not independent. What do we make of the combination of a DEFUSE grant, the Wuhan origin, and the FCS?
We can’t simply separate the FCS from the grant proposing to insert an FCS. Nor should we separate these two facts from the emergence of a virus fitting the grant’s description - a sarbecovirus with an FCS - in the exact city where the grant proposed to do this work.
Thankfully, there is a mathematical approach to handling dependence among our observations. Conditional probabilities are defined as
which we can re-arrange to say
If, for example, we wanted to look at the FCS and DEFUSE, we could write
and thus we can decompose dependent pieces of evidence on the left-hand side by conditioning our probability estimates on the right-hand side.
Under a lab origin, this becomes
and now we can estimate the probability of an FCS given DEFUSE and a lab-origin being appropriately high. If we apply this method to the FCS, Wuhan, and DEFUSE, we get
showing how dependent pieces of evidence can have conditional probabilities estimated by serial conditioning.
If I’ve lost you, you’re not alone. The math I’m saying here is not obvious even to many quantitatively skilled scientists, and no fault of their own. My main point here is that when I (Alex Washburne) estimate posterior probabilities, I account for the dependence in our observations because this is the accurate way to decompose the full problem - the totality of circumstances & evidence and the zoonotic evidence we lack - into an estimate of posterior probabilities. I think this way because I received my PhD studying theoretical ecology and evolutionary biology, or how to weigh pieces of evidence when evaluating competing theories regarding the way species interact or how species originated. My background is this a niche skillset that requires knowing the biology and the math.
Any scientific theory worth considering has implications, as a theory without implications is untestable. We shake-down theories in ecology and evolutionary biology by assuming some subset of facts - such as assuming DEFUSE was funded somehow and a lab accident occurred, or assuming animal trade led to the introduction of a sarbecovirus to Wuhan - and then use the subset of facts we assume to assess the consilience of the remaining facts. The conditional probabilities expounded above are just a formalization of this logic we use to test theories.
We should, as implied in the last paragraph, apply this logic to steel-man zoonotic origin theories as well. Thus, the emergence of the virus in Wuhan and an outbreak in a wet market and raccoon dogs DNA being found in that wet market must be considered jointly… along with inconvenient facts like raccoon dogs not being suitable hosts for SARS-CoV-2, a surge of care-seeking behaviors closer to the Wuhan Institute of Virology prior to the wet market outbreak, an anomalous December 1, 2019 uptick in the use of the word “SARS” on the Chinese social media app Weibo, and more.
All pieces of evidence must be weighed together, but unlike physical weights that simply add up or independent probabilities that multiply we have to allow the evidence to enhance each-other - as with DEFUSE, FCS, and Wuhan under a lab origin theory - or competitively inhibit each-other and destroy the theory as evidence contradictory to the wet market origin has done.
Pairwise interactions between pieces of evidence can be represented as a Bayesian knowledge graph. For example, under the zoonotic origin theory the DEFUSE and Wuhan origin pieces of evidence are both independent pieces of evidence. Under a zoonotic origin, the conditional probability of a Wuhan origin is not affected by further conditioning on the DEFUSE grant and so there is no connection between these pieces of evidence in the knowledge graph. Under a lab origin, however, the conditional probability of a Wuhan origin is higher when we condition this on our discovery of the DEFUSE grant (and prior collaborations), since more evidence of coronavirus work in Wuhan on the precise clade of viruses that emerges will increase the odds of a Wuhan origin under a lab accident.
We can connect these arrows either way and sequentially condition on evidence arranged in any order, but some orders are more powerful by virtue of improving our estimates of probabilities, especially when we can raise probabilities under one theory close to 1. The value of connecting the evidential dots in a Bayesian knowledge graph grows with increasing strength of dependencies, and hence when considering the totality of circumstances around the origin of SARS-CoV-2 it’s valuable to think carefully about scientific dependencies of a lab origin theory.
All Roads lead to Reverse Genetics Systems
The DEFUSE grant proposed to insert a furin cleavage site in a bat SARSr-CoV reverse genetics system in Wuhan.
SARS-CoV-2 emerged in Wuhan with a furin cleavage site. Valentin Bruttel, Tony VanDongen and I showed that the genome of SARS-CoV-2 is also consistent with a reverse genetics system.
Now, in addition to Bayesian knowledge graphs that leave biologists in the dust, we have to contend with bioengineering and molecular biological logistics that leave most statisticians in the dust. To be honest, it’s been discouraging to see how difficult it is to convey these complex topics without support or wide venues. A lot of that challenge is that there simply aren’t good venues to discuss this topic with the technical awareness and forensic seriousness it deserves. We feel virtually alone in our efforts to popularize reverse genetics systems. People on our side of the theoretical divide often don’t understand what this is, and people on the other side of the theoretical divide know what this is and are doing everything they can to prevent people from understanding this. Understanding the centrality of reverse genetics systems to the origin of SARS-CoV-2 would, after all, undermine the zoonotic origin theory and potentially lead to severe regulations of virology work, so it’s understandable they don’t want to popularize what we found.
It was after finding this evidence of a reverse genetics system that I became outspoken about the origins of SARS-CoV-2. There’s a reason for that.
Logistics of wildlife virology
When my colleagues were catching bats across SE Asia, they wanted to surveil wildlife and find new viruses. That was the goal of most wildlife virology work. From the USAID’s PREDICT program or the Wellcome Trust + Bill & Melinda Gates Foundation supported Global Virome Project (supported through CEPI) to the DARPA PREEMPT grant that I was a part of or the DEFUSE proposal to the same DARRPA PREEMPT program, everyone wanted to find wildlife viruses, characterize their diversity, and study their functions in the labs.
When you catch a bat in Asia, especially when you’re a scientist catching bats with the express purpose of finding viruses in bats, you can’t just ship a sample with live viruses overseas. On one hand, you can freeze your sample, but that requires getting dry ice to whichever remote, humid, buggy location you’re catching bats from and setting up the logistics of moving samples smoothly without risk of the samples warming up. RNA viruses like CoVs are very fragile things and the half life of infectious virions is on the order of 1 day at room temperature.
To transport viruses from the field to the lab, you need to stop the decay. While viruses can survive freezing, especially if you minimize the number of freeze/thaw cycles, freezing samples from the field is difficult because it requires carrying liquid nitrogen and/or dry ice to remote, inhospitable field conditions.
A far superior method for moving wildlife viral samples from the field to the lab is to stop the degradation of RNA by a combination of ethanol/methanol and chemicals that inhibit the activity of “RNAse” enzymes that break down RNA. Storing samples that could contain viruses in fluids that stabilize the RNA and inactivate the virus are beneficial in that they can preserve your viruses and, if they stop infectivity, reduce the risk of anyone getting infected while the sample is in transit or anyone intercepting a sample and acquiring a live potential pandemic pathogen.
Don’t just take my word for it: here’s Ben Hu and the DEFUSE PI’s writing their methods in 2017 on how they caught bats & transported samples (Hu et al. 2017):
The authors caught bats, stored samples in a viral transport medium, and stored the samples at -80 degrees C.
Did they thaw out the samples and cultivate them to study live viruses?
Nope. They sent the samples straight into PCR screening and sequencing.
Wildlife virologists were interested in surveying the viruses present in wildlife, and often it was enough to just plan on testing samples (PCR screening) and sequencing strains to get the data we needed to publish papers… that is, until we needed to study the functions of whole viruses.
Logistics of laboratory research
In other words, the lab work in Wuhan did not appear to involve people bringing live animals into the lab, but rather most of the viruses were transported in transportation media and stored at -80 before being sent for sequencing. When could they have been infected by a live virus in this process?
It most likely would’ve been during studies of whole viruses.
How do you study the functions of whole viruses after samples have been frozen and then destroyed via sequencing? Well, you need to have a whole virus. You can try to culture live isolates from the frozen samples, but that’s a very inefficient way to go because (1) most culturing efforts will fail as the live isolates have died during transport/freeze-thaws, (2) it’s not always clear which cell types will best enable culturing, and (3) it’s not clear you’ll get the most interesting virus when you culture things.
A far more efficient method is what Ralph Baric, Shi ZhengLi’s mentor, called an “efficient reverse genetics system”. Reverse genetic systems allow you to go straight from sequences on the computer to, with time, a live virus whose genome is almost identical to that on the computer save a few edits required to build the reverse genetic system.
A reverse genetic system for an RNA virus is a DNA copy of the virus’ RNA genome. The DNA copy can be stored in chunks docked inside plasmids and each chunk manipulated separately prior to forming the full-length clone, or the full-length clone can be stored together inside a ‘bacterial artificial chromosome’ or BAC.
In Hu et al. 2017 mentioned above, the authors find a bunch of bat SARSr-CoVs and use a reverse genetic system from the previous year, rWIV1. Below is the image from Zheng et al. (2016) building rWIV1. (A) shows the genome and all the separate coding regions like ORF1a, ORF1b, the Spike protein, and so on. (B) shows how the authors modified the genome sequence on their computer by, adding (solid triangles), removing (white-filled triangle), and utilizing pre-existing (no triangles) pre-existing cutting/pasting sites for a specific enzyme, BglI. Part (C) shows how the authors built their full-length DNA clone of the coronavirus WIV1 by ordering 8 fragments of DNA, clipping the ends with BglI, and utilizing unique sticky-ends (the unpaired base-pairs like the TAA on the top-strand on the right-hand side of fragment A that pairs with complementary bases ATT on the lower-strand on the left-hand side of fragment B).
The authors put a promoter called “CMV” on the left-hand (5’) side of their full-length clone, and this promoter allows the authors to turn-on transcription of this DNA copy, forming a full-length ssRNA clone of WIV1. The authors electrocute cells to form pores (a process called “electroporation”) and insert the ssRNA rWIV1 inside the cells, after which the cells’ machinery gets hijacked by the virus and live rWIV1 virions are made by immaculate conception of modern biotechnology.
If rWIV1 caused a pandemic prior to its publication, we would be able to identify its synthetic origin because of the regularly spaced BglI restriction enzyme recognition sequences and the hotspot of silent mutations found within exactly these restriction enzyme recognition sequences.
Catch a bat, store the sample, sequence the sample, look at sequences, and then focus your effort making either swappable parts and/or reverse genetic systems with the viruses you want to study. That was wildlife virology pre-COVID. It’s alphabet soup, it’s complicated molecular biology, but it’s also essential to how we do our Bayesian reasoning when evaluating a lab origin theory. All of this, together, is why I became convinced enough in a lab origin I felt it was worth potentially burning all my bridges with wildlife virology by publishing our findings and popularizing this (or trying to).
The reverse genetic system above enabled Hu et al.’s 2017 work. With the reverse genetics system above, they could swap Spike genes using common tools used from synthetic biology. Specifically, the authors could order special fragments that contain additional cutting/pasting sites with the enzymes BsaI and BsmBI, and these sites could be used similar to BglI above to assemble various chimeras containing Spike genes from different viruses, allowing researchers to evaluate parts from sequenced viruses in the whole-virus of rWIV1.
SARS-CoV-2 has an unusual even-spacing of BsaI and BsmBI restriction enzyme recognition sequences, the same restriction enzymes used in Hu et al. to swap spike genes and build chimeras. The specific arrangement of these recognition sequences in SARS-CoV-2 would even allow the Spike gene segments to be made entirely flanked by BsaI, enabling the re-use of parts from Hu et al. 2017 if one so desired, flanking S-genes with BsmBI for easy insertion.
Most importantly, most significantly, and most often overlooked: the BsaI + BsmBI recognition sequences in SARS-CoV-2 are modified from close relatives by an unusually high concentration of silent mutations. The P-value we obtained was around 1 in 20 million odds of finding as-or-more extreme a concentration of silent mutations within these sites than in the rest of the genome. Every other criterion for a reverse genetics system was met, down to the unique sticky-ends for faithful assembly. Combined, we estimate approximately 1 in 50 billion odds in finding a wild coronavirus with a restriction map so consistent with a reverse genetic system.
It could be a coincidence, but it would be quite a shockingly large coincidence, especially since the only times BsaI and BsmBI had been used on a CoV infectious clone prior to the COVID-19 pandemic was in Wuhan in 2017 by the same authors who wrote DEFUSE just one year later.
Some people say the restriction sites in SARS-CoV-2 are found in other viruses but, first, those are just sequences of other viruses, including sequences from the lab believed to have been at the heart of lab origin theory, so it’s not clear the sequences are actual viruses as they can be easily manipulated. The risk of deception is high, and with such high stakes it’s important to be aware of such risks not only when evaluating sequences but also when analyzing early case data that were filtered by the government with the most to lose from a lab origin theory.
Second, the restriction sites found in other viruses and the possibility that this happened by recombination doesn’t explain the hotspots of silent mutations within these sites. Recombination should swap parts more or less at random and there’s no reason for recombination to favor swapping parts that lead to hotspots of silent mutations and evenly-spaced sites.
The 1-in-50-billion odds estimates stand. I was surprised by the statistical analysis that uncovered this finding, and because it is so significant I have been less surprised by follow-on corroboration such as DEFUSE grants containing order forms for BsmBI or the Wuhan Institue of Virology publishing two new pre-COVID reverse genetics systems assembled with BsaI and BsmBI, using silent mutations to disrupt pre-existing sites, exactly as we hypothesized.
Revisiting Lab Origin Theory
The lab origin theory of SARS-CoV-2 contains two variations that are not equally likely. One theory holds that a researcher became infected with a CoV while sampling animals, brought that CoV back to Wuhan, and started the pandemic. This theory becomes unlikely because the virus that they claim infected a field worker just so happens to have the furin cleavage site that was proposed to be inserted in a lab, and the virus just so happens to have that 1-in-50-billion anomaly.
The variation of lab origin theory is the synthetic variation. Under the synthetic origin theory, SARS-CoV-2 was first just a genome on a screen. To cultivate the virus, researchers needed a reverse genetic system. To insert a furin cleavage site, researchers needed a reverse genetic system. The furin cleavage site is docked inside another restriction enzyme’s recognition sequences (BsaXI) which could allow for golden mutagenesis… but only if they had a reverse genetic system. To swap receptor binding domains as some hypothesize, researchers needed a reverse genetic system. To swap parts and build a pan-coronavirus vaccine, researchers needed a reverse genetic system. For a live virus to have emerged in Wuhan as a consequence of this synthetic biological work, it needed to start its existence by immaculate conception of modern biotechnology: as a reverse genetic system.
The evidence of a lab origin is not independent. The presence of the furin cleavage site, the Wuhan origin, and the DEFUSE proposal all combine to imply the virus is very likely to have originated (under lab origin theory) as reverse genetic system.
With that knowledge, I set out with Valentin Bruttel and Tony VanDongen over 2 years ago to evaluate whether or not SARS-CoV-2 is consistent with a synthetic origin.
We found a pattern so unusual, we conservatively estimate 1-in-50-billion odds of that pattern arising in a wild coronavirus. The probability of a pattern like this emerging in a lab-derived virus, conditioned on DEFUSE, an FCS, and a Wuhan origin, is nearly 1. That, alone leaves us with Bayes factors close to 50-billion in favor of a lab origin.
Until we can popularize Bayesian knowledge graphs for theoretical evaluation of dependent pieces of evidence beside our popularization of wildlife virology and laboratory virology methods, it will be difficult for the world to know the origin of SARS-CoV-2.
For those brave & interdisciplinary enough to study it carefully…
The origin of SARS-CoV-2 is knowable. Knowledge is simply true belief, and we arrive at true beliefs by rigorous syntheses of evidence. Know the evidence, know the biology and the theoretical graph of implications connecting evidence, and know the math to weigh things properly, and one may arrive sooner at true beliefs.
Excellent. For a (only slightly) less technical argument, my favorite supporting the lab leak hypothesis is that the animal species to which SARS-CoV-2 was best adapted at the time of its discovery is the human species. This virus does not replicate in any of the animal species present at the wet market, nor in bats... The ability to infect a species correlated with the binding affinity of Spike for the ACE2 receptor of that species. And the SARS-CoV-2 Spike protein had few mutations in the first few months of the virus propagation, in contrast to the situation with SARS-CoV in 2002-2003 when its Spike sequence evolved quickly to adapt to its new host, again indicating that SARS-CoV-2 was already well adapted to humans at the beginning of 2020.
Wow!