Predicting ADMET Properties Before You Touch the Bench

The standard medicinal chemistry workflow treats ADMET evaluation as a sequential gate: find binders, then test pharmacokinetics. This made sense when property prediction was unreliable enough that experimental measurement was the only defensible data. That calculus has shifted. Models trained on ChEMBL activity data and multi-lab assay compilations now predict several ADMET endpoints with test-retest variability comparable to the assays themselves — which means delaying property evaluation until after synthesis is a resource allocation choice, not a scientific necessity.

We model 11 ADMET properties in our hit identification pipeline. This piece is about how we approach each one, where the models are reliable, and where they are not.

The 11 Dimensions and Why Each Was Chosen

Property selection was not arbitrary. Each of the 11 endpoints was included because it either addresses a dominant clinical attrition reason (metabolic instability, hERG-mediated cardiac toxicity, poor bioavailability) or provides the mechanistic context needed to interpret other properties. They are:

Aqueous solubility (kinetic) — DMSO stock dilution artifacts are a persistent problem in biochemical assays; predicted solubility flags compounds likely to precipitate before the assay can measure anything meaningful.
Lipophilicity (LogP / LogD at pH 7.4) — Controls membrane permeability, protein binding, and metabolic exposure simultaneously. The single most predictive descriptor for compound tractability.
Human intestinal absorption (Caco-2 P_app) — Predicted efflux ratios above 2 and A-to-B P_app below 5 × 10^-6 cm/s flag permeability concerns early.
Blood-brain barrier penetration (BBB score) — Mandatory for CNS programs; actively filtered against for peripheral targets where CNS exposure would increase liability.
Plasma protein binding (f_u,plasma) — Modeled as fraction unbound. Compounds with f_u < 0.01 are flagged; not excluded, but the pharmacokinetic implication for dose projection needs to be explicit.
Microsomal stability (HLM T_1/2) — Human liver microsome half-life. Anything below 20 minutes at 1 µM in the prediction is a red flag unless there is a clear structural rationale for metabolic blockade.
CYP3A4 inhibition — The highest-volume cytochrome P450 isoform in drug metabolism; predicted IC₅₀ below 3 µM is clinically relevant for DDI risk modeling.
CYP2D6 inhibition — Narrower substrate scope than CYP3A4 but responsible for first-pass metabolism of many CNS drugs; particularly important if co-administration with psychiatric medications is plausible.
hERG channel inhibition — Predicted IC₅₀ at hERG K_v11.1 below 1 µM is a hard flag. QTc prolongation liability has ended more programs in late-stage development than almost any other single mechanism.
Oral bioavailability (F%) — Modeled as a combined function of solubility, permeability, and first-pass extraction. Predictions above 20% are considered acceptable for most oral programs; below 10% triggers a structural review.
Synthetic accessibility (SA score) — Not traditionally grouped with ADMET but treated here as an equivalent gate. An SA score above 5.5 (on the 1-10 RDKit scale, where 10 is least accessible) correlates poorly with CRO quote success. We treat this as a first-class filter, not a post-hoc annotation.

How the Models Are Built

We use graph neural network (GNN) encoders that operate on molecular graphs — atoms as nodes, bonds as edges, with 3D conformation-derived features where available. The GNN architecture is an attentive message-passing network related to the Attentive FP and MPNN families, though our implementation has been modified to handle multi-task prediction across the 11 endpoints simultaneously rather than training separate models per property.

Multi-task learning matters here because ADMET properties are correlated. LogP is mechanistically upstream of HLM stability, BBB penetration, and protein binding. Training a model that learns these dependencies jointly produces better-calibrated predictions than training 11 independent regressors, particularly for compounds structurally distant from the training distribution.

Training data for each endpoint draws primarily from ChEMBL (release 33), with supplementary data from literature compilations for the toxicity endpoints (hERG, CYP inhibition). Confidence intervals on each prediction are calibrated using conformal prediction — not just reported as model RMSE on held-out test sets, which is the standard but gives no per-compound uncertainty information.

Where the Models Are Reliable and Where They Are Not

Predictions are most reliable for compounds structurally similar to the training data. ECFP4 Tanimoto similarity to the nearest training set neighbor above 0.4 correlates with calibrated confidence intervals in the range we would consider usable for prioritization (typically ± 0.3–0.5 log units for LogP, ± 0.4 log units for HLM T_1/2). Below Tanimoto 0.2, confidence intervals widen substantially and the prediction is better treated as a structural flag than a quantitative estimate.

The endpoints with the most reliable predictions across scaffold classes are aqueous solubility, LogP, and hERG inhibition. The endpoints with the most variable performance in scaffold-novel space are oral bioavailability and Caco-2 permeability — both require integrating multiple upstream properties, and errors compound.

We are not claiming the models replace wet assays. They do not. A predicted HLM T_1/2 of 45 minutes is a prioritization signal, not a confirmed value. The appropriate workflow is: use computational ADMET to rank and filter a large candidate set, then run experimental assays on the top 20-30 candidates. The goal is not to replace the assay but to avoid paying for assays on the bottom 95% of the candidate set.

A Concrete Example: Filtering a 2,400-Candidate Output Set

In a recent program targeting a kinase with known selectivity constraints, we generated 2,400 candidates from a generative run constrained to pIC₅₀ > 7.0 against the target kinase and selectivity against a related off-target above 100-fold. After binding affinity ranking, the 2,400 compounds were passed through the 11-property ADMET filter with the following hard cutoffs applied:

LogP: 1.0–4.5 (CNS MPO-informed range for the target tissue)
HLM T_1/2: > 30 minutes predicted
hERG IC₅₀: predicted > 3 µM
CYP3A4 IC₅₀: predicted > 10 µM
SA score: < 4.5
Predicted oral F%: > 15%

Of the 2,400 initial candidates, 1,847 were filtered by at least one criterion. Of those, 892 failed on LogP alone — the generative engine, constrained on binding affinity and selectivity, had overfit slightly toward higher-lipophilicity scaffolds. After applying all filters, 347 compounds remained. These were re-ranked by Pareto optimality across the remaining six properties, yielding a prioritized set of 48 compounds recommended for synthesis. The top 20 had predicted pIC₅₀ above 7.5, HLM T_1/2 above 60 minutes, and SA scores below 3.8.

That is the workflow. Not a replacement for the assay — a decision about which compounds earn the assay.

The Case Against Over-Filtering Early

One risk in aggressive computational ADMET filtering is over-exclusion. If the property cutoffs are too tight, you can filter out scaffolds that have real promise but score poorly on one endpoint due to model uncertainty rather than genuine liability. We have seen programs where a medicinal chemist's instinct would have retained a compound that the model excluded at hERG — and the experimental hERG IC₅₀ turned out to be 8 µM, not the predicted 2 µM.

This is the right tension to maintain. Computational filters should expand the candidate set you can evaluate, not shrink it below what a skilled medicinal chemist would select by inspection. The practical rule we use: hard filters are reserved for endpoints with the highest model reliability and the highest clinical consequence if wrong (hERG, CYP3A4, solubility). Softer properties like oral bioavailability get penalty weights in Pareto ranking rather than hard exclusion thresholds. The chemist should always review the ranked list before the final synthesis queue is locked.