Fragment Expansion with ML Guidance: From 200 Da to Drug-Like

Fragment-based drug discovery (FBDD) starts with a deliberate constraint: find small molecules that bind your target—even weakly—and then grow them. A fragment hit typically sits at 150–300 Da, binds with K_d values in the 100 µM to 1 mM range, and has ligand efficiency (LE) that justifies further investment. The hard part is what comes next: how do you grow a 200 Da fragment into a 450 Da drug-like molecule without destroying the binding geometry that made the fragment useful in the first place?

The conventional approach relies on crystallography, iterative synthesis, and medicinal chemistry intuition. That workflow is not broken. But it is slow, and each synthetic cycle consumes 2–4 weeks. ML-guided expansion is not a replacement for the crystallography or the chemistry judgment—it is a filter that helps you decide which vectors to explore before you run a synthesis.

Why Fragment Expansion Is Different from Standard Hit Optimization

When you're optimizing an existing hit with pIC50 of 6.5, you're working within an established SAR context. Substituent effects are measurable. You know which part of the molecule drives potency and which drives selectivity. Fragment expansion is different: your starting compound may bind at only 500 µM with just a handful of defined contacts. The expansion step is not optimization—it is hypothesis generation. You're asking: given this anchor, which additions could generate additional binding interactions that lower K_d by two to three orders of magnitude?

This framing matters for how ML enters the picture. The model isn't refining local SAR; it's navigating from a sparsely characterized starting point into a large chemical neighborhood. That requires generative capacity, not just regression.

The Expansion Vectors Problem

Most fragments have two to four viable growth vectors: positions on the scaffold where adding atoms won't disrupt binding pose or introduce steric clash with the protein. Identifying those vectors from a crystal structure is manageable. The combinatorial challenge comes when you consider what to attach at each vector. Even a modest pharmacophoric set—say 500 building blocks per vector, across three attachment points—yields 125 million candidate combinations. Scoring them all with docking is feasible in principle but slow; scoring them all with an accurate free energy method is not feasible at this stage.

ML models trained on activity data can prune this space drastically before docking. In our workflow at Nanolix, we run a two-stage filter: a graph neural network predicts binding likelihood from 2D structure alone, cutting the 125M candidates to roughly 50,000; a docking-plus-rescoring pass then reduces that to 200–400 candidates worth synthesizing in the first round. The GNN pass runs in minutes. Docking 50,000 compounds against a clean crystal structure takes a few hours on a small compute cluster. The net effect is that you hand the chemist a prioritized list of 300 expansion candidates with predicted poses, rather than asking them to enumerate by hand.

Scaffold Integrity During Expansion

One failure mode we see repeatedly: expansion candidates that look active in silico because the scoring function is rewarding molecular weight rather than binding efficiency. A 420 Da molecule that scores better than a 250 Da molecule simply because it fills more space in a large hydrophobic pocket is not a good hit—it is a size artifact. Ligand efficiency tracking is critical throughout. We flag any candidate whose predicted LE drops below 0.25 kcal/mol/heavy atom relative to the parent fragment, even if absolute predicted pIC50 looks acceptable.

This is not saying fragment expansion should preserve original LE perfectly—that's unrealistic. As MW grows, LE will drift downward. The question is rate of drift. A fragment at LE 0.45 that expands to 350 Da and LE 0.32 is probably following a productive trajectory. The same expansion landing at LE 0.18 indicates the added mass isn't translating into binding contacts, and we deprioritize it before it reaches synthesis.

A Concrete Example: Pyrrolidine Fragment to Kinase Inhibitor Scaffold

In a kinase hit identification program we ran in early 2025—a Ser/Thr kinase with an unusually deep allosteric pocket adjacent to the ATP site—we started from a fragment screen that yielded a pyrrolidine-based hit at 218 Da and K_d of 340 µM. Crystallography confirmed a hydrogen bond to Asp in the DFG motif. Three growth vectors were geometrically accessible.

The unconstrained expansion search flagged 87,000 two-vector combinations worth docking. The GNN filter reduced that to 9,400. After docking and LE filtering, we delivered 180 ranked candidates. Medicinal chemistry review selected 24 for synthesis, prioritizing structural diversity at the expansion points. Of the 24 synthesized, 11 showed measurable activity (IC50 < 100 µM), 4 had IC50 < 1 µM, and 2 met initial ADMET flags sufficient to qualify as lead candidates for further profiling. The 2-to-3 order-of-magnitude improvement from fragment to lead took three synthetic cycles over six weeks. That is not an anomaly—it reflects the compressive power of starting from a well-anchored fragment.

Where ML Guidance Does and Doesn't Help

ML-guided expansion performs well when the expansion is local: growing one or two vectors while retaining the original scaffold. It performs less reliably for merging, where you combine two fragments binding at adjacent sites into a single molecule that bridges both pharmacophores. Merging introduces conformational tension that current 2D-based models do not capture accurately, and 3D models need adequate training data near the merge geometry. If your expansion strategy involves merging, we'd recommend treating ML scores as weak signals and leaning more heavily on pose-constrained docking and free energy perturbation (FEP) for the merge step.

Similarly, ML expansion guidance is less useful when the fragment binds in a disordered region with multiple possible poses. If the crystal structure shows two or three alternative binding modes with similar crystallographic evidence, you need to resolve that ambiguity—through additional biophysics or MD simulation—before the expansion model can be trusted. Feeding ambiguous pose data into a property model produces noise, not guidance.

Synthetic Accessibility in the Expansion Loop

A commonly skipped constraint in computational expansion: whether the proposed molecules can actually be made. FBDD campaigns have historically suffered from beautiful in silico expansions that require seven-step syntheses to access. We incorporate SA score (synthetic accessibility score, scale 1–10 where 1 is most accessible) as a hard filter: candidates with SA score above 4.5 are deprioritized unless the predicted activity improvement is exceptional. Separately, we run a building-block availability check against Enamine REAL and a curated subset of commercial catalogs. If the expansion adds a substituted bicyclic that has no commercial analog within two synthetic steps, we flag it. Chemists can decide whether to include it—but they make that call with cost and time implications visible.

The combination of LE tracking, ML activity filtering, and SA score screening means that the 200–400 candidates we deliver to the chemistry team are not just computationally ranked; they are filtered for practical synthesizability. That distinction matters when your synthesis capacity is limited to a few dozen compounds per month.

The Chemist Is Still the Decision Point

We want to be direct about scope: ML-guided expansion changes the cost of generating and filtering hypotheses. It does not change who makes the decision about what to synthesize. A model that flags 300 candidates still requires a medicinal chemist to evaluate them for structural novelty, IP landscape, off-target risk profile, and project-specific constraints that no algorithm encodes. The chemist reviewing that ranked list will typically discard 50–60% of it for reasons the model could not anticipate.

That is not a failure of the model. It reflects the correct division of labor: the algorithm explores adjacencies computationally so that the chemist's time is spent on judgment rather than enumeration. The goal is not automation—it is amplification of expert chemistry by removing the mechanical steps from the workflow.