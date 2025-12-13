The impressive performance of many neural networks increases with the scales of model parameters and data used to train them. In technology areas where compute is not the limiting factor for model performance and where satisfactory model architectures have been developed for the problem (such as the transformer architecture for large language models), data will be the limiting factor and asymmetries in data availability can tilt the scales in tech competition over model performance.

AI biodesign tools are driving a wave of excitement about the possibility that large neural networks could allow us to design drugs or possibly even make virtual twins of cells to generate plausible hypotheses about the impact of drugs on cells. Riding the wave of excitement are policies aiming to accelerate the use of AI for technological advancements, from the Genesis Mission announced by the White House to the 2026 National Defense Authorization Act recognizing the role of biosecurity in shaping national security and carving out, among other things, directions for the Department of Defense to specifically take action in support of artificial intelligence tools used in biology.

As a Quantitative and Computational Biologist with a PhD from Princeton, over 15 years of experience analyzing biological data, and a passion for advancing US strategic interests, I want to weigh in on what I believe will be the biggest challenge for advancing US strategic interests in the Bio-AI revolution: data, and the data deficit that exists across countries.

I have wonky, technical perspectives on how we can improve model training and architectures in biology (e.g. transformers for sequence data can likely be beat by architectures better designed to record information the way information is stored in DNA sequences that don’t contain spaces between tokens), and both model training and architectures will play a role in improving AI biodesign capabilities, but none of that will help the United States beat China in the Bio-AI race because of the data deficit that exists across these countries.

By “data deficit”, I intentionally mean to play on words and suggest both there is not enough data at the moment to do what we want these tools to do, and also that there are inequalities in the data we export versus the data we import to train AI models. For the purpose of distinguishing these two, let’s call the first challenge “insufficient or inadequate data” and reserve “data deficit” for direct connection to the related concept of trade deficits, as inequalities in data exported versus imported across countries is essentially a trade deficit regarding valuable data.

The US will invest heavily in data generation to tackle the problem of insufficient and inadequate data, and current policies aim to make such data available to train AI tools. This is a noble goal, but if we don’t pay close attention to potential data deficits generated by our investments in more and better data, we may not fully realize the value creation we intend from these investments in US AI-Bio tools.

For example, consider one of the most important biodesign tools on the market: Google DeepMind’s AlphaFold. AlphaFold trained its models on, among other data sources, the Protein Data Bank or PDB, an international organization providing protein structures that can be used for model training. PDB is a public good, a place where scientists predominately from Japan, Europe, and the US submit protein and molecular structures, and where any AI biodesign company can scrape structures for structure-prediction tools.

PDB structures contributed by the top structural genomics projects.

AlphaFold utilized PDB to produce their revolutionary advancement in protein structure prediction. The PDB thus served as a public good, with countless hours and resources of lab scientists poured into protein expression, purification, and structure inference through various methods such as blasting protein crystals with expensive X-ray guns i.e. X-ray crystallography. AlphaFold expressed a profound gratitude for the PDB resource used in its model training, and in a way the success of AlphaFold was a success of an entire community of protein biochemists working for decades on this hard problem that AlphaFold was able to crack.

Shortly after the success of AlphaFold, Chinese entities who played little to no role in the PDB efforts above entered the scene as fast-followers. Baidu, an internet service company analogous to a Chinese Google, created PaddleHelix, a computational biology platform, and PaddleHelix developed a knock-off AI biodesign tool called (I kid you not) HelixFold aiming to compare its performance to AlphaFold. For a quick pop-science detour to give you an eyerolling appreciation of the nature of this rip-off, protein structures are comprised of two smaller types of structures, alpha-helicies and beta-sheets, which fold and wrap together to give us the full structure of a protein. Alpha-Fold, thus, even had its name ripped off by Helix-Fold.

Admittedly, in 2023 researchers in China along with PDB leaders in Japan, Europe and US announced the formation of PDBc, or the Chinese Protein Data Bank, aiming to contribute protein structure data to the broader PDB. However, when you visit their website, I personally get warnings that my connection isn’t private (to be expected from a surveillance state) and can’t/won’t proceed to evaluate the quantity or quality of their database but can assure you it is too little, too late compared to what’s been done to enable AlphaFold and its rip-off, HelixFold.

The Alpha/Helix-Fold story is an important lesson for US investment in AI biodesign tools. On one hand, scientific public goods are invaluable for scientific advancements and the free world’s databases from NCBI to PDB have proven to be effective accelerators of discovery. On the other hand, where there is a serious concern about competition from fast-following countries like China that don’t contribute to the public good, these databases also support the competitors of American, European, and Japanese companies better deserving of benefiting from their government’s investment in expensive data collection and database maintenance.

As policymakers proceed with these exciting new initiatives, missions, authorization acts, and other vehicles to support AI tools in the US, it’s essential to consider the importance of data driving AI model performance, and consequently the public goods problem introduced by the data deficit where data we pay money to collect and share is used by countries, like China, that don’t play by the same rules and don’t pitch into the public good they benefit from.