The potential of advanced models for drug discovery
I recently attended the Drug Discovery Innovation Summit in Berlin.
This was a small and exclusive conference whose aim is to bring together participants mainly working in Pharma and BioTech who wish to drive towards the potential of AI in Drug Discovery.
As a long-term AI practitioner, I feel genuinely excited by the potential to use the newly available and powerful tools in AI for something other than targeted ads and digital assistants.
I was eager to hear how leaders in the field such as Laura Matz and John Wise see the potential AI playing out at the cutting edge of drug discovery.
I have worked previously as a consultant to organizations, helping them design ‘AI roadmaps’, using the ubiquitous ‘value vs cost’ matrix.
However, the truth is that in many cases, AI is akin to snake oil and data is not ‘the new oil’ that they have been promised.
For many, the accessible data simply isn’t rich enough to produce AI models that outperform the status quo when ‘outperform’ is defined holistically - as in, total benefit and total cost to the organization.
At a fundamental level, no machine learning engineer or novel model architecture, no matter how smart, can solve for a shortcoming of available data.
This is not however the case for drug discovery, evidenced by the heavy investment into this sector - 4.7 billion USD in 2021 alone - in recent years.
This is because the potential for AI to accelerate the pace of drug discovery - target identification, molecule design, virtual screening, success forecasting, clinical trial optimisation - is huge.
All the necessary ingredients are present: large amounts of relevant data exist (genomics, proteomic, chemical compound libraries, and clinical trial records). More importantly, there is a hugely expensive and painful problem to be solved which is the time to market for new drugs: typically 10-15 years.
So why has it not yet taken off? Why are there currently no AI-designed drug candidates later than stage 2 clinical trials (INS018_055, developed by Insilico Medicine)?
Is it just that it’s too early to really see the end-benefits yet, or are there more fundamental challenges that need to be addressed?
Addressing the challenges in drug discovery process
Unsurprisingly for a group of people with backgrounds in science, the discussion in Berlin rapidly focused on - rather than the successes - the challenges faced in AI for Drug Discovery.
The elephant in the room here is that despite the promises and heavy investment into drug discovery and insilico trials have not been proven out, yet.
In the words of Najat Khan, Chief R&D Officer and Chief Commercial Officer at Recursion: “Doubters remain to be convinced. While AI early adopters have high hopes for their drug discovery tools, solid proof of progress remains scarce. The share prices of public AI-first companies have been battered in recent years, with both Recursion and Exscientia down around 70% from their 2021 initial public offering prices.”
LLMs really took off in 2018 when the first foundational model - BERT - arrived on the scene, but several years later, unlike what is currently playing out in the digital assistance space, the end game has not started: the market is not currently flooded with AI-designed drugs, nor is it likely to be in the immediate future. Why?
Firstly, pharmaceutical and biotech companies are often limited by the amount of data available within their own organizations.
Drug discovery models cannot be trained solely on internet content.
The lack of available data within single organizations can prevent Pharma and Biotech AI models from reaching the accuracy levels to produce outcomes that actually are an improvement on what they are doing without AI.
Quoting the RoseTTaFold team, “The primary factor limiting accuracy is the relatively small size of the training set; whereas there are over 21,000 distinct protein-only structure clusters in the PDB, there are only 6,016 distinct sequence clusters with protein-small molecule complexes”
There was plenty of lively debate in Berlin on how important it is to carefully curate this data to avoid ‘garbage in, garbage out’, although even this was put under scrutiny - what about if you have a lot of garbage?
As was pointed out, ChatGPT was trained on the internet, after all.
Unlike analytics and lightweight ML models, deep learning excels at automatic feature generation and noisy data can be filtered out by the law of averages, if enough data exists to do this.
Rather than carefully curating smaller ‘artisanal’ datasets, could more be possible if a much larger amount of ‘garbage’ could be worked with as a single entity across organizational boundaries?
Jury is most certainly still out on that one for drug discovery, but if the lively debate amongst the experts is anything to go on, it seems to be worth a look.
Some pointed to publicly available trained protein structure models - such as those found in NVIDIA’s BioNemo suite - AlphaFold, ESM2, RoseTTAFold - the source of so much close attention (even described as Pharma’s ‘AlphaGo’ moment) over recent years.
However - the majority of such offerings have been trained by big tech companies on the same corpus of publicly available data such as the Protein Data Bank.
The risk of this is that at some point, models would be expected to eventually plateau on accuracy gains. To compound this problem, generic (foundational) models don’t work well for tasks unique in healthcare and life sciences. The pharmaceutical industry requires models that are focused on disease-specific, and biologically valid data to happen.
For example, for a model to recognise rare, emerging or complex diseases, 360 degree data relating to these conditions is required, including patient data.
These to some extent exist, but often reside within organizational or regulatory walls.
So, it’s not just increasing the raw volume of data that will help here, it’s the right kind of data. So public data, or data within individual organizations is not enough.
Collaboration between organizations would help in generating large and diverse enough datasets to succeed in these efforts.
They would improve not just the model accuracy but also its generalisability to novel compounds, biological variability, and so on.
Biotechs want to collaborate with Pharmas to pool their data. Pharmas want to collaborate with other Pharmas.
In addition, there are other AI-first companies whose product is their actual model architectures, who would like to collaborate with data owners such as Pharmas, while retaining control of their model IP.
So, the concept of collaboration is great, but at the end of the day most data owners (and model providers for that matter) also have a business to run, salaries to pay, and are visibly nervous at the idea of giving away their organization’s IP or competitive advantage.
How can trusted collaborations be formed in this competitive environment?
Previous attempts at collaboration have been made in Drug Discovery to pool resources.
For example, the MELLODY project - a collaboration across a number of European PharmaCos training models across distributed and proprietary data was a pioneering attempt in this space.
However, such collaborations face a number of difficulties mainly centered around the long time that compliance, IP protection and other regulatory challenges added to the collaboration.
Such time sinks reduce the number of experiments that can be run which can limit the performance benefit that can be seen in such collaborations. Ultimately, trust and not technology is the bottleneck in building these potentially powerful collaborations.
Successfully using federated learning for drug discovery
The motto of the conference was, ‘For our patients, we must do better’.
Most people who choose to work in Drug Discovery as a discipline believe in this and strive to build the best solutions possible.
Key to achieving this is to set up collaborations where the incentives are aligned and the appropriate safeguards are in place.
Simpler said than done, perhaps, but the simple things, such as audit trail logging, purpose restriction and clear separation of collaborative IP, if handled via technology and not red tape, have the potential to speed these complex collaborations up greatly.
Speeding up regulatory hurdles increases the speed at which experiments can be run which nearly always results in better models.
AI is often touted as a ‘fits all purposes’ solution. It is not.
Like choosing the right tool for your DIY project, AI is only useful for certain purposes and often best used in conjunction with other techniques.
In particular AI can lead to great strides forward where very large corpuses of data are available.
Again, quoting Najat Khan: "There are a lot of data sets within pharma companies that we might be able to pull together under a third party that could help shape and move the entire industry forward“.
So it’s really crucial to expand the size and diversity of the available data as much as possible and this is where collaborative data-pooling technology such as federation, especially when administered by a third party, has the potential to really improve outcomes.
This is particularly true in areas of drug discovery where the most relevant data is sparse: rare diseases, pediatric disease, novel targets, early-stage discovery.
How Federated Learning holds a promise for advancing drug discovery
Federated learning holds serious potential for drug discovery. It allows collaborations to work together by safely pooling their data to create much larger datasets without risking model IP. If a robust governance layer is also in the mix — for example audit trails and IP separation—this can be used to run more experiments faster and ultimately reach a better result, without tripping over compliance and security hurdles.
If you’re curious about how you could use federated learning and governance to collaborate across organizations, reach out!