Product

Join a network

Build your network

Company

What is a Federated Data Network and How Does it Support Cross-institutional Research?

A federated network is defined as a network that spans across geographical or organizational boundaries. Such a federated network contains interconnected nodes that are operationally independent, yet centrally managed for efficiency and ease of use. A federated data network is

A federated network is defined as a network that spans across geographical or organizational boundaries.

Such a federated network contains interconnected nodes that are operationally independent, yet centrally managed for efficiency and ease of use. A federated data network is a network set up specifically to make data available for use, e.g. sensitive health data for research questions.

Of course, the value of a federated data network depends mainly on the amount of high quality data that is available across the network.

Let’s look at examples to clarify this more.

What is an example of a federated network?

The Intonate-MS project studies which factors progress Multiple Sclerosis, especially in early stages of the disease. Data for this type of research is rare (data scientists call that “sparse”) but “ … a federated network can [...] overcome research bottlenecks that impede data-driven science” (Oh, Smolders, Bujis, et al. 2023). So to get reliable results, data from different sources needs to be combined.

An architecture diagram showing various data sources of hospitals getting used in a federated learning setup. Data is being harmonized and intzermediate training results are being aggregated securely via APheris and produce new research output — Federated Data Network Architecture Intonate-MS Consortium (Image excerpt from Oh, Smolders, Buijs, et al. 2023))

However, country-specific legal requirements for data protection as well as administrative hurdles make this cross-institutional data collaboration difficult.

A federated data network infrastructure provided by Apheris is expected to overcome these hurdles and thereby support a better understanding of Multiple Sclerosis.

A project in earlier stages is the Optima project aiming to build a federated data network across European data providers, which will thereby span across geographical, governmental and organizational boundaries.

Architecture diagram showing the structure of the OPTIMA consortia for cancer research — Source: https://www.optima-oncology.eu/

The Optima network specializes on cancer data, to inform patients and clinicians about their treatment options. Moreover, pharma companies and healthcare payers will be able to use the network to learn how their drugs perform in a real-world setting, outside of controlled research studies.

Now, let’s discuss the importance of federated data networks, using health data as a running example.

Importance of Federated Data Networks in Healthcare

The examples above clarify what Federated Data Networks are, but how helpful can they be for healthcare and research?

The summary paper by Hunger et al.[1] looked into that for oncology research specifically. They systematically screened which types of research questions were answered using real world evidence data from multiple sources in a Federated Data Networks.

Real world evidence data is healthcare data collected in the “real world”, so outside of research studies, for example during hospital visits. The authors found 40 publications and data of up to 5 million patients were used in these studies.

However, most Federated Data Networks were rather small, half the studies had five or less data partners.

Nevertheless, the small networks were sufficient for the research studies they found. The authors concluded that Federated Data Networks indeed have the “potential to answer a variety of different research questions in oncology”.

Those studies aim at understanding risk factors, incidence, survival of the patients as well as subgroups of patients that may need different forms of treatments for a better outcome.

We expect their findings to generalize to other diseases beyond cancer.

With bigger Federated Data Network projects like Optima or Intonate-MS mentioned above, even more data will become available, supporting further research. So it’s fair to conclude that indeed Federated Data Networks have the potential to support research in healthcare.

Key Concepts in Federated Data Networks

In Federated Data Networks, data stays in the network nodes where it was originally produced or stored.

Computational algorithms access data where needed and only transfer (intermediate) results over the network. Phrased differently as in Hallock et al.[2], queries and algorithms “visit” the data in Federated Data Networks.

Distributed Data Analysis

First, let’s look into queries that “visit” data and which types of research questions can be answered with those.

The oncology summary paper[1] contains multiple examples for statistical queries such as counting to determine the frequency of rare cancers as well as treatment patterns and incidence of cancer types.

They find four out of the 40 studies use only such statistical queries. So even though distributed statistical analysis looks simplistic, it is enough to gain valuable insights.

The Apheris Federated Data Network supports distributed statistical queries and even has the most commonly used queries implemented and ready to use in the statistics package.

The documentation provides an example statistics workflow, and the full reference for the statistics package is available as well.

Federated Machine Learning

More complex algorithms to “visit” the data in a Federated Data Network are for machine learning (ML).

While the concrete implementation can vary from ML model to model, almost all types of models implement the following pattern:

The ML model is trained locally in each participating network node, using a suitable part of the data available there
The resulting model parameters (often referred to as model weights) are shared via the network, often with a centralized server
The global ML model is updated based on the locally computed model parameters. In simple implementations, model weights are averaged, but more complex update mechanisms can be implemented
The resulting global ML model is used as a basis for further local training in the network nodes.
The process of local training and global aggregation is repeated until some stopping criterion is reached. That can be as simple as a fixed number of rounds, a given time limit, or until the model no longer makes significant improvements (also referred to as “convergence”).

The Apheris Federated Data Network is built with federated learning in mind. A concrete example workflow for one particular ML model can be found in the documentation.

An Example Collaboration Across Institutions

Apheris provides the Federated Data Network for the AI Structural Biology consortium which focuses on drug discovery. The goal is to use federated machine learning to deepen our current understanding of molecular structures and their interactions. That is an essential first step towards finding molecules that are safe and effective for usage in future drugs.

The biggest blocker in that task is the insufficient availability of public data to train the ML model.

Pharma companies have larger and/or more diverse proprietary datasets that they of course cannot directly share. More concretely, the amount of data about protein-small molecule interactions which is siloed in a few large pharma companies is an order of magnitude larger than the publicly available data.

Moreover, the private data covers a broader chemical space. That means, training on such non-public data has the potential to increase the model quality in general, and potentially helps the model to generalize to those areas where no public data is available.

Next, let's consider antibody-antigen interaction data, which describe how antigen-binding proteins in the immune system attach specifically to foreign molecules (called "antigens") to mark them for destruction, helping the body defend itself against infections or toxins.

When considering antibody-antigen interactions data, there is more public than private data available. However, the private data is more diverse and therefore adding that data to a machine learning training dataset is likely to guide the learning of structure prediction models towards better generalization.

However, the pharma companies that have joined the consortium are in general willing to make their proprietary data “visitable” by some federated learning algorithms after careful evaluation of privacy and security risks and mitigation thereof using technical, legal and contractual safeguards.

In short, the Federated Data Network connects proprietary data that otherwise wouldn’t be available and has the potential to advance knowledge in drug discovery.

Cross-institutional Research is Simplified by Federated Data Networks

As we have described with the example above, in cases where there isn’t enough public data available, Federated Data Networks enable participants to use more data and thereby enable cross-institutional research.

We will explore in this section some of the biggest roadblocks for cross-institutional research and how Federated Data Networks reduce these roadblocks.

Quality Research Depends on Access to Quality Data

Intuitively, it is clear that more data is better, but let’s dig a bit deeper into the theory behind it.

Hunger et al.[1] point out that more data means larger sample sizes. This in turn increases the statistical power of the entire analysis and leads to enhanced generalizability of the results.

In other words, the more data is used, the lower the risk the results from the study fail to generalize.

Especially when studying rare diseases, one has to start with a large dataset to observe the rare event often enough to draw reliable insights about the disease.

Different types of cancer are examples of such rare diseases.

Also the exploration of adverse events (such as adverse drug effects) needs a large sample size to begin with. Unfortunately, the amount of data is not the only factor to consider when gathering data for research, the data also needs to be in the same format.

Addressing the Lack of Data Standardization

As soon as we need one algorithm to work on different datasets, these datasets need to be in the same format, for example, the same events of interest need to be referred to in the same way.

A workshop organized by the DNV[3] with participants from different Federated Health Data Networks across academia and industry also found the lack of standards as one of the main challenges for collaboration.

Especially in the EU, there is a lack of standards for collecting and storing electronic health records , leading to unstructured datasets that even if available in a Federated Data Network cannot be readily used.

Even though there are several potential standards already available, they are not (yet) used consistently.

The workshop paper[3] mentions i2b2, FAIR and OHDSI; we would add Snomed to that list as well.

For cases where data not yet adheres to a required standard, data providers as well as data users will need to harmonize datasets. In federated data networks based on Apheris technology, users have three different options to harmonize data for their needs . All these options do not require data to be moved which appears to be a strict requirement for many use cases:

Pre-canned preprocessing pipelines: Use existing algorithms in Apheris, like the statistics package, to deal with non-uniform data.
Custom harmonization jobs: Define your own workflows based on your technology stack (anything that can run in Docker works in Apheris).
Integrate with 3rd-party tools: Apheris provides a modular API approach for interfacing with the technology or vendor of your choosing.

Data format and standards might be different across sites or similar among some. Therefore, Apheris allows users to run harmonization jobs either custom to an individual site or the same job federated across multiple sites.

We believe that as more data becomes used and potentially even monetized in Federated Data Networks, the higher the return of investment into following those existing data standards will become. This will convince more data owners to implement those. The Apheris Federated Data Network is one way for Data Custodians to monetize their data.

Operational Efficiency

To understand why operational efficiency is crucial for research especially across institutions, we first discuss the effect of lacking efficiency.

Learned et al. describe which barriers they faced when accessing omics data sets for pediatric cancer.

That data includes DNA (genomics) and further biologically related data such as RNA. And while that topic seems niche at first, we believe that many of the problems described therein do occur elsewhere as well.

The authors gain access to datasets that are in theory available upon request.

They observe that in practice, relevant data is scattered across various data providers, making the initial discovery already time-consuming.

Once relevant datasets are found, gaining access took them 2-3 months on average, longer if the data needed to travel across national borders.

Finally, data needed to be downloaded, and keep in mind that genomics datasets are huge, so that took time.

Moreover, different data providers used different, often very specialized tools for the downloads, which the authors needed to learn how to use.

In some cases, human errors in labeling the data render the downloaded data useless for the research purpose, which is especially frustrating after going through all the above mentioned time-consuming steps.

Clearly, that process does not scale.

Federated Data Networks are much more efficient in achieving the same goal. First of all, a time-consuming data download is no longer needed, instead, as described above, intermediate and final computation results are transmitted over the network.

Depending on the task and algorithm, that can be a lot less information to transmit compared to raw data downloading.

Also the initial step, finding relevant data is expected to be much simpler and efficient in a Federated Data Network due to the centralized management structure which could advertise nodes’ respective datasets.

Moreover, gaining access to an existing Federated Data Network is most likely easier and faster than negotiating with all the relevant data providers one by one.

Setting up contracts with network nodes might still be required, however, the process is repeatable and therefore most likely streamlined already by the network management.

Finally, let’s not forget that each Federated Data Network node is still operationally independent, which may result in different hardware and software being used across the network as the DNV workshop paper[3] points out.

Apheris Compute Gateway acts as the Federated Data Network’s centralized management entity and supports the discovery of relevant data.

More concretely, Data Custodians register datasets and decide whom to grant computational access to the data which may or may not need a contractual agreement between Data Custodian and data user.

The data user can then simply list datasets available to them and understand the datasets’ relevance to their task based on the Data Custodians’ descriptions.

Moreover, Compute Gateway minimizes the need for node-specific tools and abstracts from the concrete hardware and software used in network nodes.

This is achieved by wrapping around common ML frameworks like NVIDIA FLARE. Further details can be found in our guide to porting an NVIDIA FLARE model to Apheris.

Addressing Data Security and Privacy Concerns

Hallock et al.[2] observe that national privacy law such as GDPR that was intended to improve access to health data and personal information across borders has not yet shown the intended effect.

Even worse, the fear of misinterpretations have resulted in siloing of health data even more.

This effect is amplified by the different user groups having varying levels of knowledge. Some groups lack the information to make informed decisions on whether or not and which state-of-the-art privacy-enhancing technologies to use for their concrete use case, as the DNV workshop paper[3] points out.

While data security and privacy in general are a challenge both for direct data sharing as well as Federated Data Networks, the latter might be a first step towards reducing silos of health data.

The DNV workshop paper envisions Federated Data Networks to offer documentation and expert advice to understand which legal requirements to comply with and which technological options to choose to implement that.

More concretely, Hallock et al.[2] suggest each network node to implement sufficient authentication and authorization protocols to support the controlled “data visiting” by algorithms to comply with local legal requirements.

But not all relevant cybersecurity and privacy measures can be implemented on a per-node basis. As a result, they propose to use an independent orchestrator to coordinate and govern computation activities in the network while mitigating security and privacy risks.

Apheris has implemented such an orchestrator to coordinate and govern computations in their Federated Data Network.

Moreover, expert advice for users can be requested to help them comply with legal and organizational requirements.

Over time, advice is planned to be replaced by privacy and security controls and privacy enhancing technologies that are easily applicable to a wide range of tasks and use cases such that future users can select suitable controls and measures for compliance on their own.

Federated Data Networks: Build Your Own or Join an Existing One

In case this article has motivated you to build your own Federated Data Network, we recommend reading the DNV workshop results[3] next. The paper gives an overview on challenges Federated Data Networks face and provides recommendations on how to overcome them.

Want to join an existing Federated Data Network instead or build your own? Then you might want to check out

Optima for contributing to cancer research,
EU-ADR for pharmacoepidemiological and drug safety research
or the AISB Consortium for collaborative AI to predicting protein complex structures.

Let's chat

Leverage our experience in building data networks for the life sciences. Let's find a solution for your research...

Connect with Apheris

AIDrugDiscovery

Pharma

Federated learning & analytics

Share blog post to

Collaborative AI for Pharma

by Jan Stuecke

Secure data collaboration brings new capabilities to the pharama and biotech community. Read about applications, use cases and challenges in collaborative settings and how Apheris can help.

Pharma

Healthcare

Collaboration

Federated learning & analytics

Case Study: Drug Discovery AI Consortium

Foundational AI models for drug discovery have made a huge leap forward over the last 2 years. However, limited availability of high-quality molecular data is a major bottleneck to achieving the next steps forward. The AI Structural Biology Consortium was created to tackle these challenge and unblock the next revolutions in AI-driven drug discovery.

In this case study we explore:

The challenges around access to high-quality molecular data for Life Sciences

How the Consortium consisting of major pharmaceutical companies, prestigious model building partners and Apheris as the technology provider was created to tackle this

The use cases including model benchmarking & fine-tuning

Leveraging Federation to ensure secure, governed and privacy preserving collaboration between all parties

How privacy-enhancing technologies can help you achieving data privacy in enterprise ML/AI

by Inken Hagestedt

Discover the power of Privacy Enhancing Technologies (PETs) and why they're crucial for your enteprise ML/AI projects involving sensitive or regulated data. We'll demystify PETs with clear examples, equipping you with the knowledge to make informed decisions on how to best protect your data and enhance your products.

Privacy

Machine learning & AI