How to Interpret Confidence Intervals in Computational ADMET Predictions

A common moment in any computational campaign: a generative model proposes a candidate with predicted Caco-2 permeability of 18 nm/s and predicted hERG IC50 of 8.4 µM. Both numbers look acceptable on their face. But the model also reports a 90% confidence interval on hERG of [2.1 µM, 34 µM]. That interval spans more than an order of magnitude. The point estimate is not the signal—the interval is.

This post is about how to read ADMET prediction uncertainty in practice: what the confidence interval actually means for different model types, which ADMET endpoints tolerate wide intervals and which do not, and when a wide CI should stop a candidate rather than just flag it.

What the Confidence Interval Actually Represents

ADMET prediction models generally produce uncertainty estimates through one of three mechanisms: ensemble variance (train N models, report mean ± spread), Bayesian posterior estimation (explicitly model the posterior over predictions), or conformal prediction (provide coverage-guaranteed prediction sets without assuming a specific distribution). These three approaches are not equivalent. Ensemble variance is easy to implement but tends to underestimate uncertainty in sparse chemical regions. Bayesian posteriors can be well-calibrated but require careful prior specification and are computationally expensive for large molecular libraries. Conformal prediction provides the strongest statistical guarantees—the stated 90% coverage will hold empirically at that level—but the intervals can be wide in unfamiliar chemical space, which is exactly when you need them.

When a platform reports a 95% CI, the first question to ask is: how was that interval derived, and is it calibrated? Calibration means that of all predictions where the model reports 95% CI, roughly 95% of actual measured values fall within the interval. A model can report narrow intervals that are wildly overconfident—which looks precise in reports and fails in the wet lab. We report calibration curves alongside our predictions because the interval is only meaningful if its coverage is empirically validated on held-out data from the same chemical space.

ADMET Endpoints Differ in Tolerance for Uncertainty

Not all ADMET properties carry the same decision weight, and the acceptable CI width varies accordingly.

hERG inhibition is the most intolerant of wide uncertainty. Regulatory expectations require characterization of hERG effects in early development (ICH S7B), and a compound with predicted hERG IC50 of 8 µM carrying a CI of [2–34 µM] effectively spans the safe-unsafe boundary depending on realized value. At 2 µM, you have a potential QTc liability. At 34 µM, you have comfortable margin. That ambiguity is not manageable without wet measurement; the compound should go to a patch-clamp assay before progressing, not advance on the point estimate alone.

Microsomal clearance (CYP3A4, CYP2D6 intrinsic clearance in liver microsomes) is slightly more tolerant of uncertainty in early hit identification, because high clearance is a soft flag rather than a hard stop—you can often mitigate it through structural changes if you know which metabolic soft spot is being attacked. A CI of [moderate–high] clearance still tells you something actionable: explore metabolic stability by analyzing the likely site of oxidation.

Aqueous solubility predictions carry intrinsic uncertainty from the measurement protocol itself (kinetic vs. thermodynamic solubility assays differ by 10-fold for some chemotypes), which makes the CI partially irreducible at the prediction stage. Wide CIs here are often less diagnostic than narrow confident predictions in the clearly insoluble range.

BBB permeability is a binary-class endpoint for many models, but where continuous P-glycoprotein efflux or CNS penetration predictions are available, CI width matters significantly—especially for CNS programs where the entire rationale for the drug rests on brain exposure.

Out-of-Distribution Detection Is More Important Than the CI Itself

For any ADMET model, the training data defines the applicability domain. A compound that falls outside the training distribution will produce a CI that is misleading regardless of its stated width—the model is extrapolating, and the calibration guarantee no longer holds. This is a subtle but critical distinction: a well-calibrated conformal predictor is calibrated within its training distribution. Compounds in novel chemotypes, macrocycles, or PROTAC-like structures with high MW and unusual topological features may receive seemingly reasonable CI values while actually sitting far outside the applicability domain.

We compute nearest-neighbor Tanimoto distance from every prediction query compound to the training set. For a compound with Tanimoto distance > 0.6 to its closest training analog (using ECFP4 fingerprints), we add an out-of-distribution flag independent of the CI. That flag carries more weight than the interval width in our internal triage: an OOD compound with a narrow CI is more suspicious than an in-distribution compound with a wide CI, because the narrow CI may be artifactual. Chemists generally understand this distinction intuitively once it's explained—a model can't know what it doesn't know, but it can at least quantify its proximity to what it has seen.

A Worked Decision Example

Consider two candidates from a CNS program targeting a GPR with predicted properties as follows:

Compound A: Predicted pIC50 7.2 [6.9–7.5]; predicted hERG IC50 22 µM [14–38 µM]; predicted Caco-2 permeability 24 nm/s [18–32 nm/s]; Tanimoto distance to training set 0.31 (in-distribution).

Compound B: Predicted pIC50 7.5 [6.8–8.1]; predicted hERG IC50 12 µM [3.2–45 µM]; predicted Caco-2 permeability 21 nm/s [9–51 nm/s]; Tanimoto distance to training set 0.58 (borderline OOD).

On raw point estimates, Compound B looks slightly more potent. But the hERG CI for B spans a range where the lower bound enters concerning territory, and the Caco-2 CI is wide enough that the compound might fail or pass permeability depending on which assay you run and on what day. Combined with the higher Tanimoto distance, Compound B is the higher-risk synthesis target. Compound A has tighter intervals, sits comfortably within applicability domain, and has an hERG margin that only degrades significantly at the lower CI bound. The decision to synthesize A first over B is well-supported. B should follow, but with the explicit acknowledgment that more uncertainty will be resolved at first assay and the compound may not behave as predicted.

When to Stop Rather Than Flag

We are not saying wide confidence intervals mean a compound should never be synthesized—that would eliminate most generative candidates, which by definition explore novel chemical space. What we are saying is that a wide CI on a safety endpoint, combined with an OOD flag, is grounds for deprioritization rather than flagging-and-proceeding. The distinction matters: flagging creates a list of cautions that often gets discounted under synthesis pressure. Deprioritization is a queue decision. It keeps the compound in scope but moves it below candidates where the safety signal is more interpretable.

For any compound where the lower bound of the hERG CI falls below 3 µM, we recommend wet hERG data before significant synthetic investment—full stop. The cost of a fluorescence-based hERG counter-screen is a few hundred dollars. Discovering cardiac liability at candidate selection, after 40 analogs have been synthesized around a scaffold, costs orders of magnitude more.

Reporting Conventions Matter

One practical note about how confidence intervals are reported in computational ADMET tools: some platforms report prediction intervals (covering a future individual measurement), others report confidence intervals (covering the true mean). These are not the same. A prediction interval is wider because it includes measurement variability. For ADMET decision-making, you want prediction intervals, not confidence intervals, because you will be measuring individual compounds and comparing to individual thresholds. If a platform reports only confidence intervals, the stated uncertainty is underestimating what you'll actually observe experimentally. Ask your platform how the interval is derived and what it covers—this is a reasonable question, and the answer should be specific.

The ADMET number on a screen is an estimate. The interval around it is an estimate of how much you don't know. Both are data, and both should drive decisions.