Scientific Approach

Why generative chemistry, why now.

Chemical space contains an estimated 10⁶⁰ drug-like molecules. High-throughput screening covers ~10⁶. Generative ML, applied correctly with property constraints and synthesizability modeling, bridges a meaningful fraction of that gap — not all of it, but enough to change the economics of hit identification.

The Problem

Chemical space is a navigation problem, not a retrieval problem

Lipinski Ro5 and CNS MPO filters reduce the search space — but still leave ~10²⁵ candidates unreachable by any enumeration approach. The productive region for any given target is a non-contiguous manifold in that space. Generative methods navigate toward it directionally; enumeration cannot.

Model Architecture

Graph neural networks + variational generation

The model architecture is not a black box applied to SMILES strings. It operates on molecular graphs — atoms as nodes, bonds as edges — capturing topology rather than surface-level string patterns. This matters for generalization to novel scaffolds.

GNN
Graph Neural Network Encoder
Captures molecular topology — atom types, bond types, ring systems, stereocenters — and encodes into a fixed-dimension representation. Message-passing layers allow long-range dependency capture across the molecular graph.
VAE
Variational Autoencoder Latent Space
The encoder maps molecules to a continuous, smooth latent space. Smoothness is the key property: nearby points in the latent space should correspond to structurally similar molecules with similar property profiles, enabling meaningful interpolation and gradient navigation.
MO
Multi-Objective Pareto Gradient Guidance
Property prediction models are differentiable with respect to the latent space. During generation, gradient descent moves the latent point toward regions where all property objectives improve simultaneously — or surfaces the Pareto trade-off front where objectives conflict.
SA
Synthetic Accessibility as First-Class Constraint
The SA score model is embedded in the gradient loop — not applied as a post-generation filter. The generative engine is penalized for proposing structures with high synthesis complexity, so the output distribution skews toward accessible compounds from the start.
Abstract neural network graph architecture diagram showing molecular graph encoding and latent space navigation
Training Data

Training data: public bioactivity, multi-lab ADMET, and 180K synthesis routes

~2.1M
ChEMBL Activity Data

~2.1M compounds from ChEMBL with experimental bioactivity measurements. Covers diverse target classes, therapeutic areas, and scaffold types. Used for binding affinity and selectivity model training.

Multi-lab
Published ADMET Assay Data

ADMET model training uses curated multi-laboratory assay data from published literature. Multiple measurement sources per property reduce assay-specific systematic bias and improve cross-scaffold generalization.

180K
Published Synthesis Routes

Synthesis feasibility model trained on 180K published synthesis routes from peer-reviewed literature and CRO-compatible reaction databases. Predicts SA scores calibrated against real synthesis outcomes — not just theoretical accessibility.

74% of top-20 ranked candidates show predicted binding affinity within 1 log unit of measured IC50 on internal retrospective validation
81% of top candidates confirmed accessible by CRO quote within 6 weeks on synthesizability retrospective validation
1 log unit is the IC50 accuracy threshold we use — not tighter, because we don't want to overstate the precision of computational predictions
Validation

Internal validation only — we don't claim external benchmarks

These numbers come from internal retrospective validation on held-out compounds from programs we have run. We do not claim performance against published external benchmarks where our data and the benchmark test sets may overlap.

The 74% binding accuracy figure means that on targets we have validated against, 74% of top-20 ranked candidates had measured IC50 within 1 log unit of prediction. The remaining 26% showed larger discrepancies — we flag these cases using our confidence interval model, which is why wide confidence intervals in the output are meaningful signal, not noise to ignore.

The 81% synthesizability figure means that 81% of top candidates could be quoted by a CRO without requiring custom reagent sourcing. These are conservative numbers at an early stage of the platform — and we would rather report them conservatively than inflate them for marketing purposes.

Get in touch

Questions about the methodology?

We're happy to walk through the technical approach in detail. Schedule a methodology call with our computational chemistry team. No sales agenda — just the science.

Talk to Our Scientists