Innophore Is Folding the Human Proteome Using NVIDIA BioNeMo, Creating a Fused Dataset of Structural Models for Machine Learning Purposes

In the field of human biology, proteins play a delicate yet crucial role in health and disease. Knowledge of their structure on a molecular level is a cornerstone of biological research, offering insights that can transform our understanding of cellular processes while opening new pathways for drug development and off-target search.


Paving the Way for Machine Learning Applications
In the age of artificial intelligence, the quality of the training set defines the accuracy of the model. With the goal in mind to create a reliable and accurate dataset of the human proteome, we at Innophore collaborated with NVIDIA, using the NVIDIA BioNeMo platform to predict protein structures. Within BioNeMo, we employed AlphaFold 2, OpenFold, and ESMFold to cover different methods for structural prediction. Additionally, homology modeling using Innophore’s CavitomiX platform enhanced and polished the dataset to ensure quality and robustness.

Over 42,000 Structures of Human Proteins
Previously, EBI and DeepMind made great efforts to model the human proteome with AlphaFold 2, providing structural models for 74.32% (32,782 structures) of the human reference proteome UP000005640. The base of this dataset is 81,671 protein sequences from UniProt. These sequences encompass potential proteins and their splicing variants, corresponding to 19,357 distinct (nonsynonymous) human genes. Notably, only approximately half of these sequences have undergone experimental confirmation at the protein level. Using this as a starting point, the protein structure of 42,042 distinct human proteins, including splicing variants, was calculated using all three methods available in BioNeMo.

Dataset Formats for Everyone and Every Purpose
Recognizing the diverse needs of the scientific community, we present our dataset in two formats: unedited and edited. The unedited version captures the raw structures generated by different prediction methods, offering a comprehensive view of the diverse modeling approaches. On the other hand, the edited version incorporates refinements, including a specialized dataset that excludes low prediction confidence regions. Additionally, this version includes structures in complex, with and without predicted ligands based on homologs found in the Protein Data Bank (PDB).

Applications and Impact
The applications of this dataset are vast. Structure-based drug design, a field at the forefront of pharmaceutical innovation, stands to benefit immensely from the wealth of structural information provided. Researchers can now explore protein structures in detail, identifying potential drug targets with greater precision. Furthermore, the dataset facilitates the prediction of protein function and interactions, enabling a deeper understanding of the molecular mechanisms of underlying physiological processes and diseases. This, in turn, pushes advancements in fields such as personalized medicine, where tailored treatments based on individual protein profiles become increasingly feasible.

Access the Dataset: https://figshare.com/s/2d69e0e2fcb00f46fe0e