Getting CROs to Synthesize Computationally Generated Candidates

The gap between "our generative model proposed 300 candidates" and "we have 24 synthesized compounds ready for assay" is where most computational drug discovery programs lose weeks they can't spare. Synthesis CROs are efficient at making molecules from traditional medicinal chemistry output—close analogs of known scaffolds, substituent sweeps, simple SAR matrices. Computationally generated structures often look different: unusual ring systems, non-obvious heteroatom placement, scaffolds with no close commercial analogs. Getting a CRO to synthesize them efficiently requires deliberate preparation on your end.

This is a practical guide based on what we've learned the hard way about the handoff from computational output to physical synthesis, written for teams where the chemist and the computational team are sometimes the same two people.

Sanitize Your SMILES Before Sending Anything

This sounds elementary, but it is the most common source of avoidable delay. Generative models produce SMILES strings that may be chemically valid by graph rules but contain implicit issues: tautomers in ambiguous states, unspecified stereocenters, radicals from disconnected bond handling, or SMILES notations that different software packages interpret differently. When you send a SMILES to a CRO and their cheminformatics pipeline ingests it differently than yours did, the compound they quote on is not the compound you want.

Before any submission, canonicalize all SMILES using RDKit's Chem.MolToSmiles after round-tripping through Chem.MolFromSmiles. Check for explicit stereocenters: if a compound has a chiral center and you are not specifying R/S, you are asking the CRO to quote on a racemate and potentially requesting the wrong diastereomer for a binding mode that is stereospecific. Run a PAINS filter and a pan-assay interference screen on the cleaned list—catching aggregators, redox-active compounds, and reactive functional groups before quoting saves everyone time. Our standard pre-submission pipeline at Nanolix runs: SMILES canonicalization → stereo assignment check → substructure PAINS/REOS filter → SA score > 4.5 remove → duplicate check against existing compound registry. A 300-compound generative list typically reduces to 180–220 clean, uniquely specified structures after this pass.

Provide a Synthetic Rationale, Not Just Structures

Experienced synthesis CRO chemists will look at a SMILES and immediately form an opinion about feasibility. If your computationally generated compound has a non-obvious but actually feasible retrosynthetic path, make it explicit. Send a proposed route, even if it is rough. Two reasons: first, it demonstrates that the structure is not a computational artifact but something you have thought about as a real molecule. Second, it gives the CRO chemist a starting point that may be faster than what they would find from scratch, and CROs will quote lower on structures where they have a clear path versus structures that require significant route-scouting time.

For compounds where you have no good proposed route, say so explicitly in the brief. Ask for a feasibility assessment before full quotation. "Please assess feasibility and estimated step count for the 40 structures flagged in column F before providing a full synthesis quote" is a much more efficient request than asking for a full quote on 40 structures, some of which will come back as infeasible at 8+ steps. A good CRO will appreciate the segmented request; it takes less quoting effort on their end.

Understand the SA Score Threshold Practically

SA score (synthetic accessibility score) is widely used computationally, but the score is a heuristic with a significant false-positive rate for unusual-but-feasible structures. Macrocycles, spirocycles, and highly substituted heterocycles tend to score poorly on SA score not because they are genuinely inaccessible but because they fall outside the core fragment frequency distribution the score was trained on. We have had compounds with SA score 5.8 that a synthesis CRO quoted on without issue because they had a clear amide coupling + Buchwald-Hartwig route. We have also had compounds with SA score 3.2 that came back infeasible because of a specific stereochemistry constraint the SA score doesn't capture.

Use SA score as a first-pass filter, not a hard gate. Set the filter threshold conservatively (we use 4.5 as a starting filter, not 3.5) and then let CRO feasibility assessment handle the boundary cases. Do not reject a compelling compound purely on SA score without asking a chemist to look at the route first.

Request Purity Specifications That Match Your Assays

Standard CRO synthesis output is often delivered at ≥90% purity by LCMS. For most binding assays and cellular assays, this is adequate. For SPR or ITC where you're measuring affinity accurately, impurity at 10% can create artifacts in concentration-response analysis, particularly if the impurity is a close structural analog with different potency. Specify your required purity threshold upfront—requesting ≥95% or ≥98% for a subset of priority compounds does not dramatically increase cost but ensures the assay data is interpretable.

Equally important: request counter-ion information for any compound that will be tested in salt form. The counterion in a hydrochloride vs. free base vs. TFA salt can affect solubility and assay behavior. For compounds with predicted low aqueous solubility (kinetic solubility < 20 µg/mL in our predictions), we flag them for CRO-side solubility testing before shipping—ask the CRO to include a nephelometry or thermodynamic solubility check at delivery if this is a concern.

Structuring the Batch for Maximum Learning

One practical recommendation that has significantly improved our hit rates in the first synthesis round: structure your synthesis batch around chemical diversity across the generative candidates rather than around rank-ordered predicted activity. Pure activity-rank selection concentrates your synthesis budget on a narrow chemical region of the predicted landscape. If the model is wrong about that region—and it will sometimes be wrong—you learn nothing from the whole batch. Diversity-selected batches, even if the average predicted score is slightly lower, give you SAR information across multiple chemotypes simultaneously, which accelerates the next round of computational design.

We typically select the top 40 compounds by a coverage-diversity objective: maximize predicted activity while requiring a minimum Tanimoto distance of 0.4 between any two selected compounds. This ensures that the first synthesis round answers multiple structural hypotheses in parallel. When we adopted this selection strategy, our hit rate in round 1 (defined as ≥1 active compound with IC50 < 10 µM per chemotype cluster) went from about 40% of clusters to about 65% of clusters.

Handling CRO Rejections Computationally

Expect 15–25% of computationally generated candidates to come back from feasibility assessment as infeasible or cost-prohibitive (>6 steps, no commercial intermediates). This is not a failure—it is information. When a compound is rejected, ask the CRO what the specific obstacle is: the leaving group chemistry on a specific ring position, a stereochemical constraint, an unstable intermediate. That answer feeds back into your generative model's synthetic accessibility filter and makes the next round of proposals more practically buildable.

We maintain a rejected-structures registry annotated with failure reason. Over time, this becomes a de facto CRO-grounded synthesizability constraint that we can encode as a substructure filter or use to fine-tune the SA score weighting in our model. The first time a compound class fails for synthesis, it's unexpected. The fifth time a variant of the same problematic motif fails, it should be prevented upstream.

The IP Consideration at the CRO Stage

Computationally generated structures in novel chemical space may have IP value that is not yet protected. Standard CRO agreements typically include a compound IP clause—read it carefully. Most synthesis-only CRO agreements stipulate that the client retains IP on the target compounds but the CRO retains no rights to the structures themselves. However, some CROs provide "design + synthesis" packages where they contribute structural modifications; in those cases, ownership of the derivatives can be ambiguous.

For computationally generated candidates specifically, we recommend ensuring your CRO agreement explicitly covers the structures you submit (not just "compounds synthesized according to client specification"), and that any feasibility assessment or route-scouting work they do does not create shared IP claims on the final synthesis products. This is not an academic concern for generative chemistry output—novel scaffolds from computational design are exactly the class of structures where IP clarity matters.