Why generative chemistry, why now.
Chemical space contains an estimated 10⁶⁰ drug-like molecules. High-throughput screening covers ~10⁶. Generative ML, applied correctly with property constraints and synthesizability modeling, bridges a meaningful fraction of that gap — not all of it, but enough to change the economics of hit identification.
Chemical space is a navigation problem, not a retrieval problem
Lipinski Ro5 and CNS MPO filters reduce the search space — but still leave ~10²⁵ candidates unreachable by any enumeration approach. The productive region for any given target is a non-contiguous manifold in that space. Generative methods navigate toward it directionally; enumeration cannot.
Graph neural networks + variational generation
The model architecture is not a black box applied to SMILES strings. It operates on molecular graphs — atoms as nodes, bonds as edges — capturing topology rather than surface-level string patterns. This matters for generalization to novel scaffolds.
Captures molecular topology — atom types, bond types, ring systems, stereocenters — and encodes into a fixed-dimension representation. Message-passing layers allow long-range dependency capture across the molecular graph.
The encoder maps molecules to a continuous, smooth latent space. Smoothness is the key property: nearby points in the latent space should correspond to structurally similar molecules with similar property profiles, enabling meaningful interpolation and gradient navigation.
Property prediction models are differentiable with respect to the latent space. During generation, gradient descent moves the latent point toward regions where all property objectives improve simultaneously — or surfaces the Pareto trade-off front where objectives conflict.
The SA score model is embedded in the gradient loop — not applied as a post-generation filter. The generative engine is penalized for proposing structures with high synthesis complexity, so the output distribution skews toward accessible compounds from the start.
Training data: public bioactivity, multi-lab ADMET, and 180K synthesis routes
~2.1M compounds from ChEMBL with experimental bioactivity measurements. Covers diverse target classes, therapeutic areas, and scaffold types. Used for binding affinity and selectivity model training.
ADMET model training uses curated multi-laboratory assay data from published literature. Multiple measurement sources per property reduce assay-specific systematic bias and improve cross-scaffold generalization.
Synthesis feasibility model trained on 180K published synthesis routes from peer-reviewed literature and CRO-compatible reaction databases. Predicts SA scores calibrated against real synthesis outcomes — not just theoretical accessibility.
Internal validation only — we don't claim external benchmarks
These numbers come from internal retrospective validation on held-out compounds from programs we have run. We do not claim performance against published external benchmarks where our data and the benchmark test sets may overlap.
The 74% binding accuracy figure means that on targets we have validated against, 74% of top-20 ranked candidates had measured IC50 within 1 log unit of prediction. The remaining 26% showed larger discrepancies — we flag these cases using our confidence interval model, which is why wide confidence intervals in the output are meaningful signal, not noise to ignore.
The 81% synthesizability figure means that 81% of top candidates could be quoted by a CRO without requiring custom reagent sourcing. These are conservative numbers at an early stage of the platform — and we would rather report them conservatively than inflate them for marketing purposes.
Questions about the methodology?
We're happy to walk through the technical approach in detail. Schedule a methodology call with our computational chemistry team. No sales agenda — just the science.
Talk to Our Scientists