How Physics-Based Models Improve Drug Design
Molecular docking is the computational problem of predicting how a small molecule fits into a protein’s binding site — the three-dimensional pose, the orientation, and the geometry of the interaction. Get it right, and you can rank millions of drug candidates by predicted binding before synthesizing a single one. The question of how well AI-based docking methods do this, compared to classical physics-grounded approaches, turns out to be more complicated than the benchmark headlines suggest. The peer-reviewed evidence points to real advances and real remaining gaps — and understanding both is what allows a research program to use these tools well. AQBioSim applies physics-grounded simulation at the stages where docking alone is not enough.
The core task is geometrically precise: given a protein target with an accessible binding site and a candidate small molecule, predict the conformation the molecule adopts when it binds — which atoms point where, which contacts form with which protein residues, and whether the pose is physically plausible. This predicted pose is called the binding mode, and evaluating it accurately is what makes docking useful for drug discovery.
The reason this matters at scale is virtual screening. Rather than testing hundreds of thousands of compounds physically — slow, expensive, and limited to what a laboratory can handle — a docking-based virtual screen can rank tens of millions of candidates computationally and direct experimental resources toward the subset most likely to bind. The quality of that ranking directly determines the quality of the hits that reach the lab.
The scale of the opportunity is significant. Roughly 10% of the approximately 20,000 protein-coding genes in the human genome have documented small molecule binders. The remaining 90% — many of them associated with disease — have not been successfully targeted with drugs. Docking is one of the primary computational tools for exploring those unmapped proteins, predicting whether a binding site exists and whether candidate compounds might reach it.

Classical docking tools like AutoDock Vina and CCDC Gold work through search-and-score: a search algorithm samples a large number of candidate poses for the molecule within the binding site, and a scoring function estimates the quality of each pose. The top-ranked poses are returned as predictions.
The scoring functions in classical docking draw on physical and chemical principles — force field terms that model van der Waals interactions and electrostatics, empirical terms fit to experimental binding data, and knowledge-based potentials derived from structural databases. They are not perfect models of the underlying physics, but they encode meaningful physical constraints: bond lengths, steric clashes, hydrogen bond geometry, and the energetics of protein-ligand contact all contribute to the score in ways that reflect actual chemistry.
Classical docking is slower than AI-based methods, which limits how many compounds can be screened in a given compute window. But that physical grounding is also why classical methods remain competitive with — and in documented benchmark comparisons, sometimes outperform — newer deep learning approaches.
The past several years have produced two waves of AI-based docking approaches. The first wave — standalone deep learning docking methods like DiffDock, EquiBind, TankBind, and Uni-Mol — replaced classical search algorithms with neural networks trained on protein-ligand crystal structures. They are substantially faster than classical tools and showed striking performance improvements on the benchmarks used to evaluate them. The second wave — co-folding models like AlphaFold3, Chai-1, Boltz-1, and OpenFold3 — predict the full protein-ligand complex structure jointly from protein sequence and ligand structure, rather than docking a ligand into a pre-existing protein structure. These generally outperform standalone DL docking methods on standard benchmarks.
The more revealing picture comes from two rigorous independent benchmarking studies that tested these methods under conditions closer to real-world drug discovery.
PoseBusters, published in Chemical Science in 2024 by Buttenschoen, Morris, and Deane at the University of Oxford, compared five deep learning docking methods against two classical tools using a benchmark set of crystal structures released after the DL models were trained — so none of the test cases appeared in training data. The findings were specific: across both physical plausibility and generalization to proteins dissimilar from the training set, no deep learning-based method outperformed classical docking tools. The failure modes were concrete. DL methods frequently produced physically implausible poses — wrong bond lengths, steric clashes between atoms, incorrect stereochemistry — that would be immediately rejected by a chemist reviewing the output. The study also found that molecular mechanics force fields contain docking-relevant physics that deep learning methods are missing. Performance on proteins dissimilar to the training set degraded substantially.
PoseBench, published in Nature Machine Intelligence in December 2025 by Morehead and colleagues, extended this analysis to co-folding models across four benchmark datasets. The results confirmed that co-folding approaches generally outperform standalone DL docking. But the paper’s conclusion was pointed: even state-of-the-art tools, including AlphaFold3, are still challenged by targets with genuinely new protein-ligand binding poses. DL methods consistently struggle to balance structural accuracy with chemical specificity when predicting binding on proteins and ligands far from the training distribution.
These findings do not argue against using AI docking tools — the speed advantages are real and practically valuable. They do argue against treating benchmark RMSD scores as evidence of real-world readiness, particularly for novel targets. The gap between performance on familiar proteins and performance on genuinely unseen ones is the field’s central open problem.
Underlying much of what makes docking hard is a simplification that almost every method makes: treating proteins as rigid. Real binding is not a static lock-and-key interaction. When a ligand arrives at a binding site, the protein often adjusts — side chains rotate, loops shift, pockets open or close. This induced-fit behavior means the protein structure that exists before binding may look different from the one that exists after, and docking to the pre-binding structure can predict poses that would not actually form.
Classical docking tools approximate around this by using ensembles of protein conformations or by allowing limited flexibility in selected residues. Most DL docking methods inherit the rigid-protein assumption or make similar approximations. Newer approaches like CarsiDock-Flex and DiffBindFR are beginning to model receptor flexibility explicitly, but these remain at research stage and have not yet demonstrated broad prospective performance.
The most rigorous treatment of protein flexibility comes from molecular dynamics simulation — physics-based modeling of how molecular systems move over time, capturing the conformational landscape that governs real binding. MD is computationally intensive, which is why it is applied selectively rather than at screening scale. But for high-value targets where a docking prediction needs to be trusted before committing to synthesis, MD provides information that docking — classical or AI — cannot.
The most significant recent development in docking-adjacent AI is not an improvement to pose prediction itself but a rethinking of where AI fits in the screening workflow. DrugCLIP, published in Science in January 2026 by researchers at Tsinghua University, is a contrastive learning framework that represents protein pockets and small molecules as vectors in high-dimensional space and ranks candidates by their similarity — bypassing the slow pose-sampling step of traditional docking entirely for the initial triage pass.
The speed advantage is substantial: up to 10 million times faster than docking alone for pre-screening. The research team used it to screen approximately 10,000 human proteins against 500 million compounds in a single day — a scale that would take years on conventional docking infrastructure. Wet-lab validation confirmed the approach produces real hits: a 15% hit rate for the norepinephrine transporter, with 12 compounds outperforming bupropion in biochemical assays; serotonin 2A receptor agonists validated at less than 100 nM; and a 17.5% hit rate for thyroid hormone receptor interactor 12 (TRIP12), a target with no prior experimentally determined holo structure or known ligands.
The authors flag the appropriate caveat: protein pockets are often conserved across protein families, making it difficult to confirm whether a deep learning model has learned general interaction principles or is pattern-matching on structural features shared across its training data. That caveat applies broadly to the field, not only to DrugCLIP. But the genome-scale screening results represent a genuine practical advance — pre-screening at a speed that makes exhaustive coverage of the human proteome computationally tractable for the first time.
Molecular docking — classical or AI — is best understood as a hypothesis generator. It identifies candidates worth investigating, not confirmed binders. The gap between a docking pose and a reliable affinity prediction is where physics-based methods earn their place.
Free energy perturbation calculates binding affinity from thermodynamic first principles at a level of accuracy that docking scoring functions approximate rather than achieve. Molecular dynamics captures protein flexibility and the dynamic behavior of binding in ways that static docking cannot. For more on how these methods complement docking across the pipeline, see the companion articles on computational drug discovery and binding affinity prediction.
SandboxAQ’s Large Quantitative Models combine physics-based simulation with AI specifically for the cases where docking is a starting point rather than a conclusion — novel targets, flexible binding sites, and first-in-class molecules where the training data dependency of DL docking is most limiting. The practical workflow is sequential: fast AI-based pre-screening narrows the field, docking identifies candidate poses, and physics-based simulation validates and ranks the most promising hits with the accuracy that clinical development decisions require.
What is molecular docking in drug discovery?
Molecular docking is a computational method that predicts how a small molecule (drug candidate) fits into the binding site of a protein target — the three-dimensional pose, orientation, and geometry of the interaction. It is used in virtual screening to rank large numbers of candidate compounds by predicted binding before synthesis and experimental testing, directing lab resources toward the most promising candidates.
How does AI improve molecular docking?
AI-based docking methods, particularly co-folding models like AlphaFold3 and OpenFold3, predict protein-ligand complex structures faster than classical search-and-score approaches and have improved performance on standard benchmarks. More recently, AI pre-screening frameworks like DrugCLIP (Science, 2026) have enabled virtual screening at genome scale — 10 million times faster than docking — by bypassing the pose-sampling step for initial triage. The speed advantages are real and practically valuable for large-scale screening.
What is the difference between classical and AI-based docking?
Classical docking tools use physics-informed scoring functions — force fields, empirical terms, and knowledge-based potentials — to evaluate candidate poses. They are slower but physically grounded. AI-based methods use neural networks trained on protein-ligand crystal structures to predict poses more quickly. Co-folding models predict the full complex structure from sequence rather than docking into a pre-existing structure. The key difference in practice is generalization: classical methods apply consistent physical principles to any system, while AI methods perform best on proteins similar to their training data and degrade on novel targets.
What are the limitations of AI docking methods?
Two primary limitations are documented in peer-reviewed benchmarking. First, physical plausibility: DL docking methods frequently produce poses with incorrect bond lengths, steric clashes, or stereochemistry errors that classical methods avoid (PoseBusters, Chemical Science, 2024). Second, generalization: performance degrades significantly on proteins dissimilar to training data, which is precisely the setting that matters most for novel targets (PoseBench, Nature Machine Intelligence, 2025). Molecular mechanics force fields also contain docking-relevant physics that current DL methods are missing.
What is protein-ligand docking used for?
Protein-ligand docking is used primarily for virtual screening — computationally ranking large compound libraries by predicted binding to a target to identify candidates worth testing experimentally. It is also used in lead optimization to predict how structural modifications affect binding pose and geometry, and in structure-based drug design to guide the synthesis of compounds that fit a target’s binding site. Docking predictions are most reliable when used as hypotheses that downstream experimental or higher-accuracy computational methods can confirm.
What is virtual screening?
Virtual screening is the computational evaluation of large compound libraries against a biological target to identify candidates likely to bind. It is the upstream step that makes docking practically useful at scale: rather than physically testing millions of compounds, virtual screening narrows the field computationally so experimental resources can focus on the most promising candidates. Methods range from fast AI pre-screening (like DrugCLIP) through molecular docking to higher-accuracy physics-based simulation, applied sequentially as a funnel.
Explore SandboxAQ’s drug discovery capabilities: