Chemical Space: The 10⁶⁰ Universe of Molecules That Has Never Been Explored

Humanity has synthesized roughly 10⁸ compounds. Drug-like molecular space contains an estimated 10⁶⁰. What lies in the gap — and how generative AI is beginning to chart it — is one of the most profound questions in modern science.

A Number That Resists Comprehension

Begin with an attempt at scale. The observable universe contains roughly 10⁸⁰ atoms. The number of stars in the entire Milky Way sits at approximately 10¹¹. Every second of every day since the Big Bang amounts to around 4 × 10¹⁷ seconds elapsed. None of these figures approach 10⁶⁰.

Yet 10⁶⁰ is the current scientific consensus estimate for the number of small, drug-like molecules that could, in principle, exist — compounds that satisfy the physicochemical criteria for oral bioavailability laid out by Lipinski's rule of five, possessing molecular weights below 500 daltons, acceptable lipophilicity, and manageable hydrogen-bonding capacity. This is not the count of all possible molecules. It is the count of molecules that would make biologically viable drug candidates. The total chemical universe, unconstrained by druglikeness criteria, is vastly larger still.

Against this backdrop, the entire documented output of synthetic chemistry — every compound ever recorded in the CAS registry, every structure deposited in PubChem, every entry across ChEMBL and ZINC — amounts to approximately 10⁸ molecules. A number that sounds large until you position it against 10⁶⁰ and recognize that humanity has explored something on the order of one part in 10⁵² of the space it purports to understand.

The drugs that currently treat cancer, infectious disease, neurological disorders, and metabolic illness were found by searching a fraction of a fraction of available chemical space. The question is not whether better medicines exist elsewhere in that space. The question is whether we can navigate there.

1060
Drug-like molecules estimated to exist
Lipinski-compliant chemical space
108
Compounds ever synthesized by humanity
CAS Registry + all databases
1011
Molecules in GDB-17 virtual library
Still 1049 short of full space

Why the Gap Has Never Shrunk — Until Now

The traditional approach to drug discovery operates through a structural assumption: synthesize compounds, screen them against biological targets, optimize hits into leads, advance leads through ADMET filtering. This pipeline has produced thousands of approved medicines and represents one of the most successful applied scientific enterprises in history. It has also operated, for its entire existence, as a search algorithm with no map.

High-throughput screening (HTS) — the industrial-scale version of this approach — can process millions of compounds against a target in weeks. A million sounds impressive. Against 10⁶⁰, it is statistically indistinguishable from zero. The compounds being screened are not randomly distributed across chemical space; they cluster tightly around known scaffolds, previously synthesized analogues, and the structural biases of the chemists who designed the library. Decades of HTS have deepened exploration of an already-sampled corner of chemical space while leaving the vast majority untouched and unimagined.

The GDB-17 database — enumerated by Jean-Louis Reymond's group at the University of Bern — represents the most ambitious attempt to map what exists rather than what has been made. It contains 166 billion organic small molecules constructed by exhaustive enumeration of atoms up to 17 atoms in size, satisfying chemical valence rules. 166 billion is a remarkable number. It is also 10¹¹ — still separated from 10⁶⁰ by a factor that makes GDB-17 itself look like a rounding error.

The conclusion is uncomfortable but unavoidable: conventional chemistry, even at its most computationally ambitious, is not a tool for exploring chemical space. It is a tool for exploiting the small region of chemical space that prior chemistry has already made accessible.

Generative AI as a Navigation System for the Unexplored

The arrival of generative AI in chemistry did not merely accelerate existing workflows. It changed the question being asked. Rather than asking 'which known compounds are most similar to this hit?', generative models ask 'which molecular structures, drawn from anywhere in chemical space, are most likely to satisfy this set of target criteria?' — and then generate those structures de novo, without reference to a pre-existing library.

The mathematical machinery underlying these models varies significantly. Variational autoencoders (VAEs) encode molecules into continuous latent spaces where interpolation and optimization are mathematically tractable — allowing a model to 'move' through chemical space along gradients of predicted biological activity. Diffusion models apply iterative denoising to molecular graphs, generating coherent structures from noise in a process analogous to the way image-generation models produce photographs from random pixels. GFlowNets — generative flow networks — treat molecule assembly as a sequential decision process guided by a reward signal, sampling molecular structures in proportion to their predicted desirability rather than simply maximizing a single objective.

Each architecture represents a different strategy for navigating a space of 10⁶⁰ with computational resources measured in kilowatt-hours rather than geological time. Each has demonstrated, in peer-reviewed benchmarks, the ability to propose structurally novel molecules with predicted activity against validated biological targets — molecules that no chemist had previously conceived and that existing screening libraries did not contain.

CHEMICAL SPACE NAVIGATION: METHODS COMPARED
Navigational Method Chemical Space Covered Key Limitation Where AI Changes the Equation
High-throughput screening (HTS) ~10⁶–10⁷ synthesized compounds Only accesses what has already been made; misses 10⁵²+ molecules AI virtual screening extends reach without physical synthesis
Fragment-based drug discovery ~10⁶ fragments × combinatorial linking Synthesis of linked structures often impractical or undocumented Generative models propose synthetically tractable linked molecules
Virtual combinatorial libraries (e.g. ZINC20, Enamine REAL) ~10¹⁰–10¹⁵ enumerated structures Still a negligible fraction of 10⁶⁰; biased to known scaffolds Latent-space navigation accesses structurally novel regions
De novo generative AI (VAEs, diffusion, GFlowNets) Unconstrained exploration of 10⁶⁰ Many generated molecules are synthetically inaccessible SynFormer (PNAS 2025) ensures every output has a viable synthetic route

The Synthesizability Problem: Why Novelty Without Access Is Worthless

For all the theoretical power of generative molecular AI, the field confronts a constraint that is equal parts practical and philosophical: a molecule that cannot be synthesized is not a drug candidate. It is a point in abstract space with no physical correlate.

Early generative models frequently proposed structures that were chemically novel by every metric and utterly inaccessible by any known synthetic route. The molecules scored beautifully on predicted activity, passed in silico ADMET filters, and had no plausible path from commercially available reagents to a solid in a vial. This failure mode is not an edge case — it is the central tension of the field. Chemical space is vast, but synthesizable chemical space is a strict and much smaller subset, and the boundary between them is not always legible from molecular structure alone.

The 2025 publication of SynFormer in the Proceedings of the National Academy of Sciences represents the most rigorous published attempt to resolve this tension. Rather than generating molecular graphs and retrospectively evaluating their synthetic accessibility, SynFormer generates synthetic pathways directly — every molecule it proposes arrives paired with a viable reaction sequence from available starting materials. Benchmarked against dopamine receptor D2 binding optimization, SynFormer demonstrated that constraining generation to synthesizable space does not preclude the discovery of novel, high-affinity molecules. It simply ensures that the molecules discovered are ones a chemist can actually make.

The practical implication for drug discovery programs is profound. A generative model that freely navigates 10⁶⁰ but delivers unsynthesizable proposals wastes the most expensive resource in pharmaceutical R&D: medicinal chemist time spent evaluating proposals that die at the bench. A model that navigates the synthesizable subset of that space — still enormously larger than any screening library ever assembled — delivers proposals that can be immediately routed to synthesis. The bottleneck shifts from ideation to execution, which is exactly where it belongs.

The goal was never to generate the most novel molecule. It was to generate the most novel molecule that a chemist can actually make next Monday morning.

ADMET: The Filter That Separates Active Molecules from Useful Ones

Binding affinity to a biological target is necessary but not sufficient for drug candidacy. A molecule must survive the gauntlet of absorption, distribution, metabolism, excretion, and toxicity — the ADMET properties that determine whether a compound can be administered to a patient, reach its target tissue at therapeutic concentrations, and be cleared without causing off-target harm.

This filter is brutal. Roughly 90% of drug candidates that enter clinical trials fail, and the majority of those failures are attributable to ADMET liabilities that were not predicted or were underweighted during lead optimization. The historical response has been to apply ADMET filtering retrospectively — generate or screen compounds, then evaluate toxicity. The result is enormous waste: resources invested in optimizing molecules that were never going to survive the human body.

Generative models trained on ADMET endpoints — or coupled to multi-objective optimization frameworks that simultaneously score binding affinity, metabolic stability, aqueous solubility, and hERG cardiotoxicity liability — represent a fundamentally different paradigm. The ADMET filter is not applied after generation. It shapes generation. Molecules emerge from the model already occupying regions of chemical space where the training data indicates favorable multi-parameter profiles. The search is not exhaustive; it is intelligent.

From Discovery to Deployment: The Gap Generative AI Cannot Close Alone

There is a dimension of the chemical space problem that receives insufficient attention in the academic literature because it does not belong to the domain of molecular generation: the gap between a computationally validated hit and a manufactured drug substance at commercial scale.

Generative AI can propose a molecule with predicted activity, favorable ADMET properties, and a viable synthetic route. What it cannot do is document that molecule's formulation behavior at scale, predict its polymorphic risk during crystallization from a new solvent system, flag that its synthesis requires a reagent with a 14-week lead time, or cross-reference its impurity profile against existing regulatory filings for structurally similar compounds. These are not computational chemistry problems. They are formulation lifecycle problems — and they are where most promising molecules are actually lost.

The bridge between generative molecular discovery and industrial reality requires a different kind of intelligence: structured knowledge of synthesis history, formulation behavior, regulatory requirements, and manufacturing constraints, organized in ways that allow research teams to act on computational proposals without starting from a blank page every time.

WHERE CHEMCOPILOT IS POSITIONED TO HELP

ChemCopilot's formulation intelligence platform sits precisely at the boundary between what generative AI proposes and what a manufacturing organization can actually deliver.

When a generative model identifies a novel scaffold from unexplored chemical space, ChemCopilot can help research teams evaluate its synthesizability against known reaction databases, structure its experimental history from first synthesis attempt forward, and build the formulation documentation that regulatory review will eventually require.

· The 10⁶⁰ problem is a discovery challenge. The problem that follows it — converting a discovered molecule into a developed, manufacturable, regulated drug substance — is a formulation knowledge problem. ChemCopilot is designed for the second problem, which means it serves every team working seriously on the first.

· We do not claim to navigate chemical space. We claim to be ready for what you find there.

The Map Does Not Yet Exist — But the First Coordinates Are Being Plotted

The history of scientific exploration contains several moments when the known world and the knowable world were revealed to be separated by a gap of almost incomprehensible magnitude — when the first maps of oceanic coastlines were drawn and vast blank spaces labeled simply 'here be dragons.' Chemical space in 2026 is such a moment.

What exists beyond the 10⁸ compounds humanity has synthesized is not chaos. It is structure — an enormous, mostly unmapped landscape of molecular possibility with its own geography of synthesizability, biological activity, and physicochemical behavior. Generative AI has given scientists the first instruments capable of navigating that landscape without physically building every point along the route. The molecules being proposed by today's best generative models are the first coordinates in a map that will take decades to complete.

What makes this moment genuinely extraordinary is not the technology. It is the scale of the opportunity. Every disease target that has resisted medicinal chemistry for decades has resisted it because the relevant chemical space was never explored. The relevant molecules exist. They have always existed. Humanity simply lacked the tools to find them. Those tools now exist. The work of finding what was always there has begun.

Shreya Yadav

AI Chemistry Muse

Next
Next

Flow Chemistry & Continuous Manufacturing: Why Batch Reactors Are a 19th-Century Problem