AI models and computing power has come a long way since the early 2000s.
In 2012, image recognition using convolutional neural networks reached escape velocity by outperforming all other approaches.
Just four years later, a Google research team published the fundamental paper Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs showing a meaningful application of machine learning (ML) for assisting medical professionals identifying diseases.
In the same year, the best GPUs reached a price performance of just 3.5b FLOPS/s per $.
Today, only 8 years later, we have GPUs providing roughly 20x better price performance.
What’s more, powerful transformer models like Google’s AlphaFold 3, EvolutionaryScale’s ESM-3 or NVIDIA’s MONAI stack to drive innovation in healthcare and life sciences.
In particular, the field of AI drug discovery has seen a tremendous investment over the past years. In 2023, roughly $14b have been invested globally in this space (in total $60b over the past 9 years).
It seems reasonable to say that we have the algorithms, resources and computing efficiency to really do better for patients.
But the missing piece is sufficient data.
AI application in drug discovery and collaboration
The integration of AI in drug discovery is receiving significant support from the academic community, as evidenced by numerous published papers such as those by Mak et al., Terranova et al and Hasselgren et al.
This academic support highlights the potential real-world value AI brings to the field.
An analysis of literature, patents, and expert insights provides an overview of AI use cases in drug discovery.
These use cases span across five primary families:
Disease pathology: AI is employed to discover and validate new drug targets. This often involves automating image analysis from phenotypic screens, mining -omics data (e.g., proteomics, genomics) to better understand how specific targets influence disease progression, modeling protein dynamics to explore target-disease pathway interactions, and identifying biomarkers to refine patient segmentation for drug research.
Small molecule design and optimization: AI is applied in two key areas: identifying hit-like or lead-like small molecules and optimizing those hits for desirable properties like binding affinity, toxicity, and synthesis. AI is used both for screening existing chemical libraries and for generative de novo design of compounds.
Vaccine design and optimization: AI use cases in this category focus primarily on the discovery and design of vaccines, with an emphasis on mRNA-based vaccines, though some applications extend to other vaccine types. For example, AI can assist in epitope selection, prediction, and binding across all vaccine platforms, while other use cases, like codon optimization and delivery system improvement for enhanced protein production with minimal toxicity, are specific to mRNA vaccines.
Antibody design and optimization: AI supports a wide range of applications aimed at identifying and optimizing antibody structures, formats, binding characteristics, and other properties. There are two major AI-driven approaches for molecule identification: screening pre-existing libraries and using the more cutting-edge de novo design capabilities. AI is also emerging in the optimization of these molecules, enhancing properties like binding specificity, physicochemical traits, and humanization to ensure high target specificity, affinity, and potency.
Safety and toxicity: AI is used to assess the safety profile of a drug or vaccine candidate. Although AI applications in this area are fewer compared to other use cases due to the specific nature of established toxicity approaches, some solutions exist. These generally fall into three categories: predicting off-target effects, simulating pharmacokinetics and dynamics, and modeling molecule interactions with biological systems through quantitative systems pharmacology (QSP).
Despite the promising potential of AI in these areas, achieving breakthroughs in drug discovery is not as straightforward as it might seem.
The success of AI models in predicting suitable protein structures, binding affinity, solubility, toxicity, or druggability scores of a molecule heavily relies on the quality and quantity of the data they are trained on.
However, even large and complex models are limited by the available data.
For instance, a leading expert in computational drug discovery pointed out that although there are around 18,000 possible targets within the human body, top pharmaceutical companies usually have data for only a few hundred targets.
This limitation leads to “even when presented with a well-defined problem, generative algorithms produce a tremendous amount of nonsense.” (Walters, 2024).
So, if one can’t make it alone, why not collaborate?
Collaborative AI for pharmaceutical industry use cases
The collaboration in AI use cases presents a solution to these challenges.
The profound impact of advanced models, like AlphaFold-2, which won the "American Nobel Prize" in 2023, demonstrates the significant advancements AI has made in drug discovery.
David Baker, Demis Hassabis and John Jumper winning the Nobel Prize for Chemistry with Alpha Fold 3 just a few of days ago underlines not only the advancement and importance of cross-organizational collaboration but also the impact AI will have on "cracking the code for protein's amazing structures".
These models' capabilities can be greatly enhanced through collaborative efforts, using shared knowledge and data to overcome the limitations of proprietary datasets.
By working together, the scientific community can harness the full potential of AI in drug discovery, paving the way for more accurate models and ultimately, more effective drugs and vaccines.
Collaborations in the field of pharmacology and among biotech companies are nothing new. Yet, the type for collaborations is shifting with the changing approach to drug discovery putting AI and data in focus.
Data science, together with biology and chemistry, is becoming the third essential pillar for the drug discovery process.
With this shift, the need for collaborative work on data and AI models becomes increasingly important.
At some point, all companies will hit a wall in their AI journey when restricting themselves to only leveraging their own data.
At Apheris, we observe three use case archetypes within our customer base:
Pharma to model provider collaborations
Pharma to Pharma collaborations
Building your own network
Pharma-to-model provider collaborations
Innovation in the field of AI-driven drug discovery is constantly happening.
Academia is developing new models but also the private sector, in the form of innovative biotechs, is contributing new approaches.
Models are often trained on a combination of public and some proprietary data (e.g. AlphaFold 3)
But how well will these models perform on specific use cases in large pharmaceutical companies?
How well a model performs in a public benchmark or on a modeler’s own benchmark is often not enough for large enterprises to license or buy an externally developed model.
Large buyers want to know how a given model performs on their data. This leads to a dilemma.
One side has the data and the other side the model. Both are sensitive IP, not easily shareable without the risk of disclosing it.
Apheris can enable such benchmarking use cases, without the risk of disclosing the model or sharing the data.
Governed federated compute in Apheris was developed for such use cases.
With Apheris’ ability to run federation in trusted execution environments both data and model IP remain confidential.
Pharma to pharma collaborations
As mentioned above, individual pharma companies often do not have enough data for training the most advanced models.
For instance, having just 200 out of 18,000 possible targets within human biology is likely not enough to train a model to an accuracy level needed for generating less nonsense.
Hence, pooling data without sharing is a sensible path forward.
Within our customer base, this archetype of collaborative AI use case is a direct collaboration among multiple pharma companies.
Here an external model provider develops a model which gets trained on the pooled data of multiple pharma companies and shared with the group after successful training.
The data stays within the secure environment of each pharma company and the model is being trained safely behind the firewall without disclosing sensitive IP.
Building your own network
Probably the most cost intensive step within the drug development process, are the three clinical trial phases (Sertkaya et al. 2024).
Identifying the right patient cohorts, selecting the best sites and evaluating large scale efficacy is currently a highly labor intensive process.
Gathering data for regulatory approval, showing superior efficacy of a developed drug compared to existing treatments is also not a trivial task.
With regulators requiring clinical trials being performed within their country to determine efficacy and toxicity for the genomic profile of their population drives cost and might delay patients in need to get the best possible treatment.
An individual, on-demand data research network for drug discovery and treatment evaluation is a core use case with Apheris.
To allow computations to be compliant, the local regulatory frameworks have to be adhered to, which is achieved via Apheris Compute Gateways and Computational Governance - keeping data local while allowing compliant computations for analytics and ML to run within a hospital's environment.
Our products are currently deployed in hospitals in German, other European and North American hospitals, allowing our customers to have a powerful data network at their fingertips.
All these different approaches to collaboration in drug discovery and across the pharmacological community are important to reduce the time and cost of bringing drugs to patients.
But before we will have the best data for a given research question at our fingertips, there are still a few challenges to overcome.
Challenges in collaborative AI for pharma
Collaborations for improved drug discovery are complex endeavors.
Federated learning is emerging as a promising foundation.
Since the initial paper on federated learning in 2016, federation has been adopted across sectors.
In industries with a high degree of data growth like healthcare, manufacturing and financial services, the amount of papers published monthly is growing rapidly.
On arxiv, you’ll find over 5.700 papers exploring federation and its characteristics for various applications and models used.
The healthcare industry and life sciences especially show an accelerating interest in federation.
As shown in the graph below, nearly 40 new papers about federated learning have been indexed per month on PubMed in 2024 - the highest monthly average rate since 2017.
The scientific community has a good understanding of federation and its applicability. This knowledge is now being used in real-world industry scenarios.
The very first large scale collaborative project in drug discovery ran from 2019-2022 as a public-private partnership. MELLODDY used the technology and models available at that time.
The sentiment, I gathered in many conversations during events, is that it was great to see such a project can work but the limitations of the model and federation technology available in 2019, left participants in the need for more.
Three main caveats were mentioned to me:
Labor investment of participating pharmaceutical companies was quite high
Approvals for computations and governance took very long, hence only a very limited amount of experiments could be run
Predictive models available in 2019, didn’t have the capabilities of today's models so the impact of more training data wasn’t as significant as participants anticipated
What has changed since then?
The first paper on federated learning was published in late 2016.
Much has happened since then.
The very same is true for the advancements in AI model capabilities shown with, for example, the release cycles of the AlphaFold models (AF1 2020, AF2 2021 and AF3 2024).
For collaborating on highly sensitive data with advanced AI models, many capabilities have to be in place to ensure IP is protected in a federated setting.
For example:
Data should always stay home with no direct access to raw data
Strong reliability and security of the underlying federation technology
Capability to fine-tune model permissions and control on the algorithmic level
Configurable privacy-enhancing technologies as add-ons
Traceability for monitoring and demonstrating compliance with regulators and internal compliance teams
In addition, our customers highly appreciate the neutral role of Apheris.
We provide the middleware in the form of the Compute Gateway and acts as a neutral, yet enabling entity.
Examples of how AI and federation helps in drug discovery
A perfect example for a promising collaboration based on more advanced technology is the AI Structural Biology consortium – a novel collaboration aimed at helping to transform drug discovery and development.
Apheris provides the technology for secure, private and governed ML & analytics on distributed data.
Leveraging federated learning and computational governance, this technology enables trust among parties and ensures the highest standard for data confidentiality and protection.
In the product view below, a data custodian controls a statistical analysis using the Apheris Statistics package. Here a data custodian can exactly define which functions are allowed and which privacy-enhancing technologies should be applied for compliant analysis.
With Apheris you add, supervise and manage such fine-grained controls with any data-driven algorithm, from simple statistics to latest deep learning models.
This approach is called Computational Governance.
Computational governance is a method to control, supervise, and track all aspects of computations on data.
It works by automatically evaluating incoming computation requests, enforcing privacy and security properties and overseeing the release of results.
This allows data custodians to enable privacy-preserving analysis of any data.
With Apheris, this approach is combined with federated learning methods.
This eradicates the need to move data, as computations are sent to the data and get executed within the data custodians environment.
Once a computation reaches a data custodian setup, computational governance is applied and only results (aggregates or weights) are returned.
Collaborative AI for the pharmaceutical industry: discuss your drug discovery program
The best data is rarely publicly available, especially not for drug discovery and other pharmacological use cases.
Apheris has proven to be a great solution and has deep experience in deploying Compute Gateways into the most regulated and secure environments like German hospitals and making collaborations among leading therapeutic organizations happen.