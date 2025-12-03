The White House recently announced the Genesis Mission, a Manhattan Project scale coordinated effort to mobilize data, compute, and other resources “to accelerate the application of AI for transformative scientific discovery.”

These are exciting times. Artificial intelligence (AI) tools have been demonstrating impressive and improving capabilities in the past 10 years across domains and data modalities. The infectious hype of AI has spread to every field of science.

As an interdisciplinary scientist, I’ve felt the excitement of AI firsthand. In my work at Sandia National Labs, I helped developed AI tools for bioscience, nuclear security, intelligence analysts, and more. As such, I feel well-positioned to share a few perspectives on Section 4 of the Genesis Mission announcement: the identification of national science and technology challenges. While I would be excited to share ideas about national challenges across all domains, I want to focus my effort on a domain in which I have spent my entire life: biotechnology.

Having also worked across academic, commercial, national labs, and federal government contracting roles, I have some advice for the broader Genesis Mission, on how to define good challenges and how to advance US strategic interests in the course of this exciting new effort.

Genesis National Biotechnology Challenges

DeepAI image generated by caption: “a robot is standing in a lab, wearing a lab coat while holding an Erlenmeyer flask in one hand and a pipette in the other. “

Automate research in BSL-4 labs

Lab work is a routine, laborious, right of passage for any biologist, and some biologists (myself included) continue stepping into the lab as required throughout their career to never be too far from the biological systems we study. Automated labs and automated lab systems are propping up to reduce the tendonitis and ergonomic frustrations of lab work while increasing the scale of data we can collect and loop through iterations of cool experiments that would otherwise be tortuous to conduct.

Automated labs could be seen as a more general capability, but a challenge would be to see automated lab technologies enhance biosafety during our handling high-consequence pathogens studied at the highest levels. This challenge is more specific and requires automated technologies cover more ground as it covers all the essential tasks of lower-security labs while adding a higher level of assurance that accidents won’t happen, that records are complete, and if any mistakes do happen they are immediately identified, reported, managed, and contained. Automated labs could spill LB broth and nobody cares, but automated labs in BSL-4 environments require automated lab systems take such detailed records that we can trace back weirdness, including accidents, should we need to revisit our lab notebook.

Additionally, BSL-4 work can involve animal models, and while it’s easy to automate liquid handling in 96 or 384 well plates, robotic and automated handling of mice and primates is a challenge that will test the capabilities of robotics to handle the uncertain behavior of live animals. By the time automated lab systems can handle the injection, orbital bleeding, and surgical examination of lab animals infected with high-consequence pathogens, we will have developed new tools for animal handling with veterinary applications, as well as surgical tools useful for medicine.

Automating BSL-4 research can also be a massive innovation in biosafety, taking the human out of the room with dangerous pathogens and fully separating pathogens that could cause a pandemic from biological hosts the pathogens are desperate to infect.

Foundation models for sequence alignment and functional annotation

All life on Earth comes from a common ancestor. We know that because all life on Earth shares the same genetic code, a seemingly arbitrary mapping of nucleic acid sequences to amino acids that we believe life would randomly re-design if life originated multiple times leaving different branches of life with different origins and different genetic codes. Because all living things have genetic material and all genetic material has the same code, genetic sequences contain rich taxonomic and functional information and sequencing has become a central tool to study life, generating some of the largest datasets in biology.

There are two central tasks in sequencing for which we still use old algorithms, but for which novel foundation models have hope of transforming biology: alignment and annotation.

Alignment of sequences is as it sounds - finding out which two sequences align with one-another. If I give you a 1000 base-pair sequence, alignment will tell you the sequence comes from a specific region of a monkey, and not E. coli. Alignment is also how we build larger genomes from so-called “shotgun sequencing” where we blow up the genome into short snippets and sequence many short snippets in parallel. With many 150 base-pair sequences, alignment tools identify 150 base pair sequences that overlap with one-another and can be combined to form a larger, contiguous sequence or “contig”. Assemble enough contigs and you can assemble an entire genome of an previously uncharacterized organism. Alignment is a good challenge because modern alignment is still using a very old (albeit very good) tool: the Basic Local Alignment Search Tool or BLAST, and a variety of other methods that, under the hood, are clever yet crude. The challenge of beating BLAST - faster results, less compute, more insights - would be a moonshot in biology.

While alignment is finding similar sequences to a given snippet DNA, annotation is telling me what a given snippet of DNA does. An example of functional annotation is sometimes people will scoop up a sample from the environment (look around, pick literally anything from the sink or air or soils outside - there are microbes there), shotgun sequence the microbes in the sample, and then align sequences to databases of genes with known or estimated functions and use that data to categorize the functions present (and, with less certainty, the relative abundances of functions) in that microbial community.

A really good foundation model for sequence data may be able to capture the relevant information of a DNA sequence, allowing us to both align and functionally annotate any snippet of DNA.

I’ve done some work in this field (albeit, like a lot of my work these days, I haven’t published it because publishing is a chore that kills my curiosity vibe) and have some insights to share here. First, many sequence foundation models focus on repurposed large language models (LLMs) where we use tokens (e.g. “where” “we” “use” and “token” “s” are a the tokens in the 4 words preceding this parenthetical phrase). To repurpose LLMs for sequence data, biologists will eitehr treat every nucleotide or k-mer (sequences of length k) as a token. Tokens are an elegant way of representing the information inherent in natural language, but it’s an inelegant way to represent the information in DNA sequences because, above all, there are no spaces between blocks of meaning in DNA sequences. Imagine if we didn’t have words and used 3-letter kmers to process the following:

eornottobethatisthequestionwhethertisnoblerinthemindtosuffertheslingsandarro

You should be able to recognize Hamlet’s famous “to be or not to be” soliloquy, but that’s only because your brain is trained to identify words as patterns, even if those words have different lengths. You trained your brain to identify words because you read posts like this one that, at first, separated the words. DNA doesn’t separate words, and to make things more complex the “context” of information in DNA is sometimes organized like Russian Dolls, snippets of meaning inside larger snippets of meaning, the context is bidirectional (e.g. promoters for downstream genes and vice versa), and sometimes context hops around in weird ways as DNA encoding functional RNA molecules will cleverly use palindromic sequences to allow the RNA to bind itself.

By challenging researchers to make the most competitive foundation models for alignment and annotation, biologists will have to overcome these challenges. As a COI (or as a challenge), I have some of my own solutions for this problem that maybe I’ll write up for broader consumption someday.

Speed up patient processing

This sounds boring, but it is one of the most important tasks for AI in medicine. You don’t feel 100% well - how long does it take you to get a diagnosis? If your condition requires treatment or surgery, how long does it take you to get your treatment? How hard is it for you to book your surgery?

The challenge of speeding up patient processing requires overcoming many smaller challenges, such as getting really good at diagnosis, which alone requires clever merging multimodal models of language (e.g. your description of symptoms), vision (when you open your mouth and say “ah”, what does it look like?), biometrics (heart rate, respiratory rate, temperature, blood results, etc.), and more. In addition to getting good at diagnosis, speeding up patient processing will require clever use of agentic AI or large action models capable of making clinical decisions (e.g. recommending you to see a specialist or getting another test, then scheduling a time for you to see the specialist or get a test, and so on). Speeding up patient processing should also be focused on benchmarks of clinicians’ time: for every hour of seeing a patient, a clinician can spend as much as 2 hours writing up reports about the patients’ condition. If such a clinician never had to write reports, if the reports wrote themselves, then the clinician would in theory be free to see 3x as many patients as they currently see, improving our medical capacity and further speeding up patient processing.

This sounds like a mundane task, but the medical sector is one of the largest sectors in our economy and health is one of the most important things for Americans’ quality of life. If this administration’s Genesis Mission could speed up patient processing, improving Americans’ health and relationship with the medical system improving better care-seeking behavior (e.g. wouldn’t you go to the doctor more if going to the doctor wasn’t such a slog?), that would be one of the most profound accomplishments, a transformation of our medical system to one that prioritizes the experience of the customer. Needless to say, speeding up the process of dealing with insurance claims is also part of this, but there the administration may encounter resistance from an incumbent industry that most Americans hate.

Improve Cell Functions and External Control of Cell

A lot of people are talking about “virtual cells” these days, or digital twins of cells that they imagine could be used to do all sorts of wondrous things. I think the concept of a virtual twin for cells is missing some critical biology - it sounds more like the imagination of someone who has been doing too much coding, and not enough actual biology to appreciate the heterogeneity of a cell, the fact that cell functions occur over a life cycle of cell division, and the fact that a lot of our data on cells is destructive (e.g. you can sequence a cell, but you won’t get its metabolites or proteins once you’ve destroyed it for sequencing). Also, there are so many trillions of cell types that the data become impossibly sparse - the concept of a virtual cell is a dream, or a hope that perhaps there are central principles and lower dimensions in these data such that we can learn the essentials of cells. Hopes and dreams are awesome, but not as well-defined and motivating as challenges ought to be.

A better cell biological challenge would be: improve critical functions of a cell, such as improving the ability of E. coli to manufacture insulin, algae to make more biofuels, or Chimeric Antigen Receptor (CAR)-T cells’ ability to identify and destroy cancer cells. These challenges focus on specific biological functions of cells and try to use data collection, insights, and bioengineering (possibly automated) to make those specific, benchmark-able functions better.

Another challenge would be to improve external control of cells, such as improving our ability to make cancer cells die such as through well-timed external control of cancer cell division as my mom and I are doing in efforts to beat pancreatic cancer. Another external control of cells may focus on automated organ production: can AI serve as the conductor of the symphony of symphonies of multicellular development and oversee the production (including mass-production) of organs used as transplants? This is a cell biological challenge that requires careful control of signaling molecules and environmental conditions to ensure the billions of cells growing into an organ from a seed all do what they need to do and become an organ.

Now that AI has made an organ, could AI make a better organ?

General principles of a good challenge, and challenges for the Genesis Mission

For biological challenges, focus on functions and performance, not abstract concepts, models for the sake of models, data for the sake of data, scale for the sake of scale, digital twins for the sake of digital twins, or automation for the sake of automation.

A challenge for the Genesis Mission will be crafting government-conceived challenges in a way that matches the interests and incentives of academic scientists with the realities of commercialization. Technologies supported by the Genesis Mission will be most successful if they draw academics and national lab scientists to work on problems that yield technologies that can be commercialized, allowing new technologies to generate revenues that support long-term R&D. As a scientist who has worked in academic, national lab, and commercial research environments, I can’t underscore how difficult this will be. Academics like to do research that props up their prior research portfolio, so posing a question in a way that people can see is related to things they already do will be a challenge. National labs are largely run by risk-averse managers, not Oppenheimer-esque scientists. Getting both academic and national lab scientists to lean towards commercializing AI tools will be challenging, especially with the difficulties of patenting AI tools (admittedly, that is changing with USPTO director John Squires’ excellent policy changes on patenting AI tools).

My advice: require commercialization plans for every grant and possibly require collaboration with private industry for most grants, and don’t guarantee funding for too long but instead provide incentives for extensions and bonuses for high-performing teams with clear asks illustrating how more funds will lead to multiplicative increases in value creation.

At the heart of the AI dream is a belief, or hope, that the automated processing of data will lower costs, improve products and services, and ideally create a virtuous cycle enabling further data collection, further model improvements, and thereby further commercial advances. A challenge for the Genesis Mission will not only be to organize datasets, provide compute resources, and fund cool research, but to create positive feedback loops whereby new tools generate new data that generate newer and better tools.

Another challenge for the Genesis Mission will be to do so in a way that advances the strategic interests of the United States. This is harder than it sounds, because at present the United States operates what I call a “data deficit” with countries like China able to access data from US grants while not as openly sharing data from their own consumers, funded research, commercial platforms, and more. I first experienced the data deficit when trying to collect business intelligence and do alternative data analyses of Chinese equities: it is remarkably easy to buy data on United States citizens and companies, and painfully difficult to get similar data on Chinese entities. Where the scale of data available (and model size) determines the performance of a model, a data deficit will benefit the country that has access to more data, and the history of US federal agencies’ data publication provides our adversaries with data they didn’t pay for, but can combine with their own, unshared data to make better models.

If done well, the Genesis Mission could indeed be the Manhattan Project of AI. The combined academic prowess, free markets, unbeatable venture capital industry, and national labs of the United States could win the AI race if we play to our strengths. By carefully analyzing the nature of AI technological development, focusing on concrete goals facing specific fields such as biotechnology, and coordinating funding, data sharing, and patent policies in a way that advances US strategic interest, Genesis Mission has a good shot at succeeding.

Good luck, Godspeed, and I look forward to doing my part as a scientist!