How Accurate Are GNN-Based Molecular Property Predictions in Practice?

Graph neural networks have become the de facto standard for molecular property prediction. On public benchmarks—MoleculeNet, TDC (Therapeutics Data Commons), ChEMBL split evaluations—GNN variants consistently outperform fingerprint-based methods and earlier architecture approaches. RMSE on logP prediction below 0.3, AUROC on Tox21 above 0.85, RMSE on blood-brain barrier permeability (BBBP) in the range of 0.2–0.4 log units. The numbers look compelling.

The question worth asking, especially before building a drug discovery workflow around GNN predictions, is whether those numbers translate to deployment accuracy. Benchmark performance and production accuracy are not the same thing, and understanding the gap is essential for using GNN outputs responsibly.

Why Benchmark Performance Overestimates Deployment Accuracy

The core issue is data leakage through chemical similarity between training and test sets. Most molecular property benchmarks use random train/test splits. Random splits put structurally similar compounds in both sets, which inflates test-set performance because the model has effectively "seen" close analogs of the test compounds during training. When you deploy the model on genuinely novel chemical matter—scaffolds that don't appear in ChEMBL, generative candidates from a VAE, macrocycles not represented in the training distribution—the actual prediction accuracy is substantially lower than the benchmark suggests.

Scaffold-split evaluations address this partially: training and test sets are divided so that no Murcko scaffold appears in both. Scaffold splits consistently show lower model performance than random splits, often by 15–30% on AUROC for classification tasks. Time-split evaluations, where the training set contains compounds published before year Y and the test set contains compounds after year Y, are even more stringent and show further degradation. The MoleculeNet leaderboard numbers, largely derived from random splits, are best treated as upper bounds on what you'll see in practice.

What GNNs Actually Do Well

For tasks where training and deployment chemical space overlap significantly, GNNs perform as advertised. Aqueous solubility prediction for compounds within the Lipinski-space MW and logP range, trained on large diverse datasets (ChEMBL + literature data), achieves practical RMSE of 0.5–0.8 log units—sufficient to classify compounds reliably into <10 µg/mL, 10–100 µg/mL, and >100 µg/mL bins. Caco-2 permeability classification (permeable / impermeable) achieves 80–85% accuracy on chemically similar compounds. CYP inhibition binary classification (inhibitor / non-inhibitor for CYP3A4 and CYP2D6) reaches AUROC of 0.83–0.90 on within-distribution test compounds.

Quantitative bioactivity prediction (pIC50, pKd) against specific targets performs well when the training set is dense: typically >1,000 data points from ChEMBL for the specific target, with good structural diversity. For well-studied kinases (CDK2, EGFR, BRAF), GNN predictions of pIC50 from structure alone achieve RMSE of 0.6–0.9 in scaffold-split evaluations. For targets with sparse data (fewer than 300 compounds in ChEMBL), RMSE typically exceeds 1.2, at which point the prediction has limited decision utility—a 1.2 log unit RMSE means the model routinely mistakes sub-micromolar compounds for 10 µM ones and vice versa.

Where GNNs Fail Consistently

Three failure modes appear with enough regularity that they should be treated as known limitations rather than edge cases.

Sparse targets. Targets with <500 data points in public databases will not support accurate GNN predictions. The model will appear to train (training loss will decrease), but it is memorizing the training set structure. RMSE on held-out scaffold splits for sparse targets typically exceeds 1.5 log units, making the predictions essentially unreliable for potency-based triage. For these targets, structure-based methods (docking with reliable scoring functions, FEP where crystal structure is available) outperform GNN predictions despite being computationally heavier.

Out-of-distribution chemotypes. GNNs encode molecular graphs as fixed-size vectors that aggregate information over k-hop neighborhoods. Novel ring systems, unusual heteroatom combinations, and macrocyclic scaffolds that have minimal structural neighbors in the training set produce embeddings that are interpolations of structurally distant training compounds. The model produces a number, but that number reflects the average behavior of structurally dissimilar molecules more than the actual properties of the query compound. Tanimoto distance monitoring (comparing the query to its nearest training neighbor in ECFP4 space) is a practical way to detect this: at Tanimoto distance > 0.55 to nearest training compound, we treat GNN predictions as order-of-magnitude estimates only.

Endpoint measurement heterogeneity. Many public ADMET datasets aggregate data from multiple assay platforms, cell lines, and experimental protocols. A hERG dataset that mixes patch-clamp IC50 values from 5 different labs with fluorescence-based counter-screen values will have label noise in the 0.5–0.8 log unit range. A GNN trained on this data cannot be more accurate than the measurement noise in the training set. For targets or ADMET endpoints where you know your training data is heterogeneous, the model's theoretical RMSE floor is set by the measurement variance—not by model architecture quality.

A Calibration Check We Run on Every New Target

Before using GNN predictions as a primary filter for any new target program at Nanolix, we run a prospective calibration check: we predict pIC50 for 30–50 ChEMBL compounds not included in our training split, then compare predicted vs. measured values. This is a minimal sanity check, not a rigorous benchmark, but it catches the most dangerous failure modes—systematic bias in the predictions (consistently predicting too high or too low by >0.5 log units) or implausibly narrow spread (the model has collapsed to predicting near the training mean for all compounds).

If calibration RMSE on this prospective set exceeds 1.0 and we cannot trace it to a specific data issue, we switch to docking-plus-rescoring as the primary filter for that target and use the GNN for qualitative flagging only. This happens more than we'd like—maybe 20–25% of targets we encounter have data quality or sparsity issues that make GNN predictions unreliable as a primary signal. Knowing when to distrust your own model is as important as building the model.

Message Passing Depth and Long-Range Interactions

A less-discussed limitation: standard message-passing GNNs aggregate information from nodes within k hops, where k is the number of layers. With k=3–5 layers, the model captures interactions up to 5 chemical bonds away. For most ADMET-relevant properties—logP, solubility, metabolic soft spots—local chemical environment within 3 bonds dominates, and shallow GNNs work well. For protein-ligand interaction tasks that depend on global conformation (3D binding pose, long-range charge effects), message-passing depth becomes a bottleneck. Equivariant GNNs that operate on 3D molecular graphs (SchNet, DimeNet, PaiNN) address this partially, but they require 3D conformer input—which means conformation generation is now in the pipeline, adding its own error source.

For our hit identification work, we treat 2D GNN predictions as efficient first-pass filters and 3D structure-based methods as second-pass scorers for the shortlisted candidates. The 2D pass is cheap and fast; the 3D pass is expensive and accurate within applicability domain. Stacking them this way captures the computational efficiency of GNNs while avoiding decisions based solely on GNN predictions in high-stakes ranking contexts.

The Right Expectation Going In

GNN-based molecular property prediction, deployed on chemically diverse candidates with appropriate applicability domain monitoring, is a practically useful tool for hit identification. It is not a replacement for measurement, and it is not uniformly accurate across all targets and endpoints. The accuracy numbers that matter are not the benchmark leaderboard figures—they are the prospective calibration results on your specific target, your specific chemotype, and your specific training data distribution. That distinction should guide how much decision weight you assign to any individual predicted value.