Announcing SAIR

Structurally-Augmented IC50 Repository

The Largest Publicly Available Binding Affinity Dataset with Cofolded 3D Structures

SAIR (Structurally Augmented IC50 Repository), is the largest public dataset of protein–ligand 3D structures paired with binding potency measurements. SAIR contains over one million protein–ligand complexes (1,048,857 unique pairs) and a total of 5.2 million 3D structures, curated from the ChEMBL and BindingDB databases and cofolded using the Boltz-1x model.

By providing this unprecedented scale of structure–activity data, we aim to enable researchers to train and evaluate new AI models for drug discovery by bridging the historical gap between molecular structure and drug potency prediction.
‍

2.5 TB

Of Publicly Available Data

>5 Million

Cofolded 3D Structures

>1 Million

Unique Protein-Ligand Pairs

Bridging a Gap in AI-Driven Drug Design

Binding affinity prediction is central to drug discovery: it tells us how strongly a candidate molecule (ligand) binds to a target protein, which is key for designing effective drugs. In theory, a ligand’s binding affinity is determined by the 3D interactions in the protein–ligand complex. However, deep learning models that use 3D structures have been limited by a lack of availability. Very few protein–ligand complexes have both a resolved 3D structure and a measured potency (IC50, Ki, etc.), so most AI approaches have had to rely on indirect data like sequences or 2D chemical structures.

One way to overcome this limitation is to generate synthetic training data using predicted structures. Recent advances in protein structure prediction (e.g. AlphaFold) mean we can computationally model protein–ligand complexes and use those for learning. Initial efforts like the PLINDER dataset demonstrated the promise of this approach. SAIR was created to dramatically expand on this idea – providing a massive repository of computationally folded protein–ligand structures with corresponding experimental affinity values. Our goal is to fill the data gap and catalyze more accurate and robust ML models for binding affinity prediction.

Build with SAIR

SAIR is offered under a CC BY-NC-SA 4.0 license and is available on Google Cloud Platform. The data are completely free for non-commercial use. Commercial users may also use the data at no charge, after submission of a short form to SandboxAQ.

SAIR can be used as a baseline for benchmarking biofoundation models or for training and/or fine-tuning new models for predicting binding affinity. We would love to hear from you about other ideas you have to utilize this dataset.
‍
In our recent webinar, our team at SandboxAQ, along with a special guest from NVIDIA, presented this breakthrough and provided instructions on how to access the data on Google Cloud Platform.

If you would like to collaborate with us to expand SAIR, or get our help for building on top of it, please fill out this form and we’ll be in touch.
‍

For More Information:

Contact us about SAIR

Here at SandboxAQ, we’re releasing SAIR to our customers and the world as just a start on revamping drug discovery. Expect new datasets, AI models, and transformative solutions to follow, across the drug development pipeline. If you’re interested to learn more about SAIR, or to see how it or models trained upon it might be expanded to include targets of special interest to your business, we’d love to hear from you.

Contact us at SAIR@sandboxaq.com.

‍