The Biggest Challenge for AI Biodesign Tools
The mathematical biological crux nobody is talking about is an opportunity for the most innovative technologists
Like William F. Buckley Jr. in a lab coat, I constantly find myself standing athwart the train of biological hype yelling “wait!”
Years ago, it was the microbiome world’s methods for analyzing data, the need for phylogenetic structure and microbial reference frames. In COVID, it was the need to appreciate a large iceberg of unascertained cases, evidence consistent with the NYC March/April 2020 outbreak depleting the city’s susceptible population, and evidence pointing to a synthetic origin of SARS-CoV-2.
Now, I’m watching billions of dollars be thrown around for AI biodesign tools that predict protein folding. As the microbiome data was still cool, as COVID was still a serious pandemic, and as zoonotic diseases are also important, today’s neural networks predicting the structure of proteins from genetic sequences are so cool. My critiques aren’t with the dream of predicting biological structures and functions from the molecular building blocks we can synthesize. Rather, I aim to point out something everyone’s overlooking, something extremely important, a subtle, tiny little gear in this large hype machine whose turning (or not) defines the success of this whole operation.
The problem is simple to state, very difficult to solve:
Biological data is, for all intense and purposes, infinitely sparse.
Every data point on organismal or protein sequence, structure, and function in all of human history produces a dataset that, still, touches the most infinitesimal fraction of biological possibilities.
Take proteins. The average length of a protein is 300-500 amino acids. There are 20 amino acids total.
Now for some math. Upper estimates on the number of atoms in the entire Universe are on the order of 10 to the 82nd power, or 1 followed by 82 zeroes.
In just 63 amino acids, there are about as many combinations of those 63 amino acids than there are atoms in the universe. Each combination of those amino acids has a unique sequence and, with it, likely a unique structure and unique function (if any).
For sequences of DNA, it takes only 137 base pairs of DNA before we have more possible combinations of that DNA sequence than there are atoms in the known universe.
This doesn’t include additional relevant variables for biology, such as the chemical environment of the amino acids, which can affect how those amino acids fold as well as how those amino acids wiggle over time (proteins aren’t static objects - they move). It doesn’t include additional factors affecting DNA, such as the proteins wrapped around it, methylation sites added to it, and so much more.
The space of biological possibilities is, for all intents and purposes, infinite.
We haven’t even scratched the surface of this.
The RCSB Protein database, for example, has about 218,000 protein structures available as of 2026. This is the cumulative effort of several generations of scientists doing intensive lab work to express, purify, crystallize/freeze, and infer the structure of these 218,000 proteins. This is simultaneously a wonder of the modern world, a result of generations of scientific effort, and at the same time an exercise in statistical futility.
Let’s say we got to 1 million proteins. That would be incredible.
The number of possible proteins on the shorter end - 300 amino acids - is still going to be 10 to the 396th power LARGER than this hypothetical database nearly 5X the cumulative effort of all protein structure projects so far in human history. That number - 10^396 - is so large it’s hard to find analogies for how big that number is. Consider 5 universes, each with their own labelled atoms. If you grab one atom from each universe to create a collection of 5 atoms each from different universes, the number 10^396 is about as large as the number of those 5-atom collections.
Okay, sure, lab work is slow, but can we explore this space with computers? Our computers run on atoms and their electrons, so there is simply no way compute can scale to properly explore the space of biological possibilities.
We need to re-think the space of biological possibilities
Many common methods used to explore the space of biological possibilities are viewing the mathematics and topology of this space all wrong.
Often, we take a gene or organism of interest and mutate it one teeny, tiny, base pair at a time. We do that for a large number of base pairs, say like 100,000 base pairs a week or maybe millions of base pairs in a month or two, and slowly explore the neighborhood of sequences close to that which we started with. For each of these sequences, we use some familiar assay to measure its function. Maybe, in rare cases, we commit to studying the protein’s structure itself. In this way, we use the tiny lantern of deep mutational scans to slowly encroach upon the infinite darkness of the biological universe.
Imagine if we were blind, if we had locations of 200,000 planets in the universe, and if we committed to exploring space by scanning one foot in every direction from every planet in the universe. I haven’t done the math, but that’s the intuition in my head behind more or less the topological process by which biologists are exploring the biological universe. All the matter in the physical universe is estimated by some to fit into a cube approximately 1,000 light years on each side, whereas the entire expanse of the universe would fit in a cube about 30 billion light years on each side. In other words, empty space in the universe is about 10^20 (1 followed by 20 zeroes) more common than matter. That makes it several universes easier to catalogue all the matter and plausibly find all the planets with this intuitively futile idea of moving one foot in random directions from the planets we know.

While space is structured in planets clustered together, the layout of interesting biological planets isn’t clear. While it varies depending on the protein, estimates range from 25-40% of mutations in proteins being harmful, so while this mutation creates a new thing it may be toxic to the cell and more difficult to study. Depending on what we want these proteins to do, this could be like the “1-foot in random direction” exploration protocol for the universe except 25-40% of the steps kill the thing you’re trying to keep alive and make better by changing this protein.
So that’s modern biology behind the AI biodesign revolution in a nutshell. We have sampled the tiniest fraction of planets in a biological universe vastly larger than even our own universe from the perspective of an atom. We take brave and expensive teeny, tiny steps in random directions from the known atoms and hypothesize that we’re going to find useful things in a cost-effective manner.
We simply can’t explore this space if we limit ourselves to thinking topologically the way we’ve been thinking, one teeny, tiny, mutational step at a time.
We need to re-think this space, rethink its topology with the same imaginative methods a wormhole represented - a method to travel large distances in space in little time, ideally with “directions” chosen or attractors at the end of the biological wormholes designed to increase the likelihood that our exploration lands on a planet and not at some randomly chosen point in empty, meaningless, cell-killing space.
The Solution
If you’re still reading, and you still think this is cool, please like this article, share it with your friends, and have them like it to. I’m currently working on technologies that solve this problem. I’ve actually been working on this space in my free time for over 15 years now, and now I’m seeing my ancient idea is now relevant.
As I’m told Albert Einstein said, something like 90-95% of the work in science is posing the problem well. Above is said 90-95%, a gift of what I hope is inspiration for others to think about this problem cleverly.
If 1,000 people like this article, I’ll take time out of work to share the solution I’ve come up with. If not, no worries - I’m going to keep working and you’ll probably hear about eventually! The purpose of a like-based threshold is not just to ensure I’m spending my time doing something people want, but it also serves as a critical hint towards what a solution looks like.
In the meantime, I’ll stop standing athwart the train of hype and allow others to write random letters, maybe mutating the English language one letter at a time in hopes of typing the solution in Arabic.


Excellent statement of the problem. On the human side of it (and imo of all biotech) are a deep ignorance and staggering hubris by people who think computing and biological systems are basically interchangeable, or at least have significant overlap.
In my estimation, that near-infinite biological universe you describe precludes any limited tech “solutions” so I’m very curious to hear yours!
Here’s the word of the day: anthropomorphize.
I’m kind of stretching the definition of it here though because I don’t just mean ascribing human features to non-human things; I mean also ascribing human SCALES (which are necessarily rooted in physical human limitations as well as extremely limited human conceptual frameworks) to non-human things.
Because we are used to figuring out how to go to Italy on vacation, we think we can also figure out how to travel to stars that are hundreds of lightyears away.
Because for the entirety of modern human evolutionary time, every production of sensical linguistic formulation came from a sentient mind, we think that LLM’s must be sentient.
Because we can go down to the hardware store and sort through the plumbing fittings until we find the right thing to fix our leaky pipes, we think we can compute our way to fabulous new protein structures that will construct humans that won’t ever get sick and die and, of course will be so much smarter - and likely more fashionable - than we are.
Repeat after me: we are all idiots.
I don’t mean that in a pejorative way - I mean it in a let’s be honest and serious and adult about it way. We are confronted by an existence-miracle which is so much bigger and more varied than our puny little human scale frameworks that understanding it rationally is not and never will be on the table. Knowing it (knowledge being a much cruder and more limited tool than understanding) is literally impossible.
The very smartest of us are still so limited by comparison with the totality of what IS that it’s just a massive category error to imagine we are ever going to understand, much less know, anything about the tremendum in which we find ourselves.
That’s not to say that we can’t progress - obviously we can do so (have done so) within limits that we are also too limited to see. We should just be serious about the limits that we can see: we’re never physically going to the stars; they are just too far away (and it doesn’t matter how many scifi novels we produce and read). LLM’s are not a few steps away from sentience; rearranging linguistic tokens has nothing whatsoever to do with sentience. We’re never going to meaningfully explore the space of possible proteins via computation; there’s too much there there as you have just pointed out so well.