One of the biggest challenges for enterprises is to operationalize technology to drive innovation. While DevOps has accelerated software development, MLOps aims to enable Data Science at scale. With the advent of Privacy-enhancing Technologies, we could now fundamentally change how companies collaborate on data and AI, but this requires an evolution of our previous approaches.
Federated MLOps is a framework and toolset that harnesses the complexity of working with third parties on sensitive data for artificial intelligence. Entering secure data collaborations with multiple parties at scale requires an approach that covers data science, data engineering, compliance, and business needs holistically.
Apheris has collaborated with many companies across all industries that have scaled their data science initiatives across company borders and now operate successfully with partners on AI and sensitive data. The following white paper describes the challenges our customers have overcome, and what you can expect when you embark on this journey.
Executive Summary
Companies that implement MLOps into their data science practices have been able to deploy and maintain machine learning models in production more reliably and efficiently at scale. Over time, this has led to a massive improvement of their core business through AI. Today, AI models are used to develop new life-saving drugs, to improve patient treatments in healthcare, to optimize production processes in manufacturing or to enhance the customer experience in financial services.
But today, enterprise data science teams are still being bottlenecked when trying to expand MLOps beyond their own corporate boundaries, for example, when sharing data or operationalizing AI across several companies or geographies.
Many companies in value chains own complementary data sets that, when combined, can offer incredible potential for innovation and new business opportunities. However, the risks involved in these pursuits can still outweigh the benefits. As such, it seems to be an impossible task to achieve alignment across a multitude of different stakeholders due to regulation, compliance, and competitive concerns.
Privacy-enhancing Technologies (PETs) aim to mitigate the risks, but alone, they are not sufficient. The implementation of such technologies into enterprise environments is extremely complex. It requires expert knowledge in many different disciplines, such as data science, data engineering, IT architectures, compliance, and legal.
Platforms that offer federated and privacy-preserving data science work to address these complexities, foster innovation by driving consistency and governability between collaborating companies, and improve productivity with integrated workflows across data science tools, infrastructure, and Privacy-enhancing Technologies.
The ability to enter secure data collaborations with third parties at scale will be a future key requirement for enterprises that have anchored AI into their corporate strategy and vision. For this reason, it is of paramount importance that data leaders find a scalable, governable, and future-proof solution today before business opportunities and competitive advantage are lost.
A Quick Primer on MLOps
Not all debt is bad, but unpaid debt compounds.
To realize what important advances MLOps has enabled within enterprises, we need to take a step back to 1992. Back then, the American computer programmer Ward Cunningham coined the term “Technical Debt” to explain product stakeholders’ need to refactor large-scaled software systems:
Technical debt is the ongoing cost of expedient decisions made during code implementation. It is caused by shortcuts taken to give short-term benefits for earlier software releases and a faster time-to-market. Technical debt tends to compound. Deferring the work to pay it off results in increasing costs, system brittleness, and reduced rates of innovation.
To counter this, DevOps emerged as the best practice in enterprises around the world. Companies across the globe have adopted its methodology and have benefited from shortened development cycles, increased deployment velocity, more transparency, and governability of the entire process.
Data Science Lifecycles Require an Adaptive Way of Working
However, the phenomenon of technical debt is not limited to software development. It plays an even larger role in Artificial Intelligence and Machine Learning, which is highly experimental in its nature and where data and models are in a continuous feedback loop.
To solve a problem, data scientists must first extract and prepare the data from different sources and then experiment with different features, algorithms, modeling techniques, and parameter configurations to find and develop a solution that works best. Apart from the data acquisition, data preparation, and model development stage, there are other complex workflows, such as testing, deployment, and monitoring. This lifecycle essentially never ends once you have it in production. In 2015, Google researchers published a paper about “Hidden Technical Debt in Machine Learning Systems” and pointed out the following:
To connect the different workflows required for an ML system, data scientists must write large masses of “glue code” and add calibration layers. Changes in the external world can influence the systems’ behavior in unintended ways and make the ML system very fragile.
Because of this tightly woven – and, in the best-case, automated structure – technical debt becomes extremely dangerous if unpaid in production environments. When ML gets used for more and more mission-critical systems, aspects such as risk mitigation, reduction of complexity, and an increase in reliability become of paramount importance.
The Rise of MLOps
Following these realizations, there was a need for a standardized approach. Not surprisingly, machine learning practitioners have borrowed the successful DevOps methodology from software engineering and have customized it for machine learning – the emergence of MLOps. This has enabled companies to accomplish several objectives:
Higher success rate: better scoping and execution of AI projects, as MLOps takes the entire ML lifecycle into consideration
Reproducibility: maintainable and reproducible ML, including better abstractions, testing methodologies, and design patterns
Transparency: alignment across business owners, engineers, and legal, and improvement in communications across technical and non-technical teams
Governability: mitigation of potential challenges and prevention of unintended outcomes and surprises
Innovation: enhancement of rapid innovation through more robust ML lifecycles
With the MLOps methodology, many companies have made great strides in the field of AI and machine learning and have further expanded their competitive advantage.
Enterprise Data Science No Longer Works Without MLOps
Nowadays, enterprise data science can no longer be thought of without MLOps. Within less than five years, an entire industry around AI and ML tools has evolved. Most of the MLOps open source tools are listed by the Linux AI Foundation.
Each of these tools covers specific steps in the ML workflow, but there is also a whole plethora of end-to-end MLOps platforms. A key benefit of using a unified, managed platform is that it eliminates the need for stitching together multiple tools at different stages of the ML pipeline, allowing for more rapid adoption of MLOps within a business. Additionally, data science productivity and user-friendliness became one of the top features of end-to-end solutions, abstracting many of the complex configuration parameters under the hood and marketed by MLOps platform vendors as the “Democratization of Enterprise AI."
Facing the Bottleneck of MLOps - The Challenges of Data Quality and Data Acquisition
With MLOps, enterprises have found an operational blueprint to scale data science within their organization. But even the most digitally mature companies with advanced Enterprise MLOps processes will soon hit the limitations of their own organizational boundaries when it comes to data availability. No company alone has access to all the data needed to build high quality machine learning models, nor can a company afford to build and maintain all ML pipelines themselves. Most data science teams come to a point where the available internal datasets prove to be insufficient for training robust and revenue-generating AI models, for several reasons:
Data is very narrow, not diverse enough, and biased
Accessible data sets are too small
Data quality is too low
Although many companies collect large volumes of data, a very high percentage of it does not meet the criteria to be considered “production-grade” and is therefore unusable for I. Most high-value datasets lie outside the current reach – specifically, under the sovereignty of other organizations.
The Untapped Potential of Complementary Data
To highlight just one example of how third-party data sets can drive value for a consuming organization (the ‘Data Consumer’), let us look at the case of the battery production process and lifecycle, which is made up of a number of steps and manufacturing parties (all potential ‘Data Providers’):
Data is being generated at each production step:
From mining the raw materials,
to the composition of a battery cell,
to the integration into a car,
and its reuse- and recycling.
This valuable data exists in silos from different owners in the supply chain. Nevertheless, every dataset constitutes one single, albeit very broad process, spanning over multiple organizations. Individual data sets complete each other and are, so to speak, complementary data.
If one could now combine this complementary data into a large data set, one would obtain a complete picture and create ML models and data products that could optimize the entire production chain. This would deliver massive value for all participants. Examples from other industries would be pharma or healthcare, where sensitive data from patients could be linked to clinical trials to improve drug efficacy and safety.
But if the potential is so great, why aren't more organizations doing this today?
Implicit in this data collaboration process is the idea that data science teams from different companies would need to work together to optimize the data quality or to train ML Models - and most importantly, they would have to share data across company boundaries. This exact situation represents one of the most complex problems of our time in data science.
Today, enterprises are being held back from collaboration and accessing high value and high potential data sets due to risk and uncertainty regarding:
Regulation, privacy, and compliance
Competitive concerns and fear of losing Intellectual Property
Lack of trust and understanding between different parties
To get access to the most required but sensitive data, companies are setting up very complex data sharing agreements and workarounds with partners. While this may work for a single use case, it is impossible to scale or automate out to future collaborations when new opportunities arise.
Privacy-enhancing Technologies as an Enabler to Data Science
In recent years, massive advancements in Privacy-enhancing Technologies (PETs) have raised hopes for replacing previously manual processes. PETs aim to facilitate data sharing by mitigating risk, enhancing value, and reducing the various sources of friction. Some of the most interesting PETs, such as Differential Privacy, Federated Learning, or Synthetic Data have achieved sensational results in various use cases across industries. Where there was previously an imbalance in the risk vs. benefit ratio of data sharing, PETs aim to shift that status quo.
When applied appropriately, PETs can provide strong guarantees on the level of privacy or security of the data they protect and thus can greatly reduce the risk of sensitive data being disclosed. This can help tip the scales in favor of sharing or processing data, enabling innovation that may otherwise have remained untapped.
Unfortunately, there is no PET silver bullet. It is impossible to apply a specific PET for one use case, and then trying to solve similar challenges with the same technology. In real-life scenarios, it is more of an orchestration of different PETs, combined with access control and additional security layers.
Here are only some of the key questions that you need to answer in advance before building ML pipelines that are augmented with PETs:
Which use case do I plan to solve; how many partners are involved?
What is the scale of the planned environment, and how do I future-proof it?
How sensitive is the data that needs to be protected?
What is the trust model between each participant?
Which kind of legislation and regulations do I have to consider?
Designing a secure collaboration environment that includes different PETs must be done with the highest level of precision to establish governance, privacy, and security, as well as to maintain data quality and usability. This requires not only deep technical expertise and strong operational discipline but can quickly become a huge overhead for data scientists that would be better served working on more value generating tasks.
History Repeats Itself - Hidden Debts in Collaborative Machine Learning
Just as companies started by building their own internal ML systems, we are yet to see a similar trend when establishing collaborative ML pipelines with third parties. To avoid the already overly complicated and manual workarounds for accessing third-party data, data scientists and engineers are experimenting with the implementation of singular PETs into their ML workflows.
The past has shown that this approach is not necessarily the right one. Instead, we are entering the era of hidden debts in collaborative machine learning:
Organizations are still lacking a holistic and standardized process that encompasses the entire data collaboration lifecycle. This situation even gets intensified, as tools and governance are required that span over multiple businesses. This can lead to bespoke and loosely connected processes, tools, and frameworks ranging from data sharing agreements, different ML open-source software, or single PETs.
All steps and frameworks are customized and patched together for individual use cases, resulting in very high maintenance costs. Associated risk and non-compliance of such bespoke approaches are extremely high.
Knowledge and best practices about implementing PETs are only built up in specific teams or belong to an individual data scientist, which increases dependencies and risk.
This results in massive technical debt and unmanageable complexity.
Collaboration on Sensitive Data at Scale Requires More Than PETs
If we recall the earlier example of real-world ML systems, we quickly realize that the illustration is only focused on the requirements for internal ML workflows. If we consider the types of processes, tools and privacy-enhancing technologies required to collaborate on AI with third parties effectively, the picture becomes even more complex.
Elements required for secure data collaboration
In a perfect world, all systems would be integrated with ML pipelines and enable the necessary feedback loop between data and models to allow experimentation and operationalization.
Let us have a look at some of the required systems in detail:
Federated Infrastructure
A federated infrastructure that supports federated learning frameworks allows for collaboration on data without having to move data or centralize it. There are multiple factors that influence which type of federated learning framework is optimal.
Asset & Security Policies
Adequate governance is critical when handling sensitive data. Requirements regarding user roles, audit logs, and granular asset permissions of all collaborating parties need to be reflected.
Isolated & Confidential Environments
All participants must be able to keep their data protected in isolated environments. Data scientists must be able to analyze the data that is not directly accessible, so the full data science workflow must be supported.
Privacy-enhancing Technologies & IP Protection
The implementation of PETs is critical, yet each privacy control comes at a cost. Available PETs must be intelligently combined and coupled to the infrastructure and business requirements.
Compliance & Audit
Any contractual framework should be based on a sound logging solution that ensures an untampered history of platform operations. Additionally, insights or predictive models produced by a collaboration must be provable and reproducible to auditors or regulatory bodies.
Common Data Model Definition
Efficient collaboration on data requires a common data model or tooling to define common data catalogs.
All the previously listed systems must be extended and customized depending on the use case and how many partners you want to collaborate with. Considering the available data management platforms, MLOps tools, and PET frameworks, companies are facing almost infinite possibilities of design choices.
The following illustration depicts a reference architecture with the array of capabilities, services, and privacy-enhancing technologies necessary. Successful collaboration on AI between multiple parties requires a complete suite of tools and services that are fully integrated across companies and designed to work together.
Building a Secure Data Collaboration Capability That Scales
As Gartner has put it: ‘Operationalization will make or break the future of AI’. Companies that find a standardized and consistent way to collaborate with partners, establish trust and operationalize AI today will have a massive competitive advantage.
Capgemini quantifies this in their report "Data Sharing Masters": Data ecosystems have the potential to generate 2-9% of annual revenue as cumulative financial benefits over the next five years for an organization of $10 billion in revenue. In addition, the value of data and model IP is massively increased when being able to collaborate with partners.
The best solution to simplify, govern, and manage the prevailing complexity is to employ Federated MLOps. This toolkit and framework allows for the set-up and future-proofing of AI and analytics pipelines across organizations.
With this framework, enterprises can securely connect data and ML pipelines across enterprises and close the loop between data acquisition, experimentation, and model deployment. The principles of Federated MLOps are simple, yet comprehensive:
Data does not move and stays behind the firewalls of each company in secure environments
Computations are brought to where the data resides
Privacy and security are guaranteed across the entire ML lifecycle; PETs are automatically enforced and abstracted to the user
There is seamless compatibility with all tools and frameworks that data scientists are used to
Focus on data science productivity while ensuring governance and standardization
By following these principles, Federated MLOps unlocks the capability within enterprises to securely collaborate with partners on the most sensitive data assets at scale.
Implementation of federated MLOps
As with MLOps, there is still the possibility for capable data science and engineering teams to design, implement, and maintain the required infrastructure for federated MLOps on their own. However, it is important to keep in mind that we are talking about technology and data flows across organizations, which requires close collaboration and a high degree of trust between parties.
To keep everyone happy, it can be tempting to allow each team to choose their own preferred frameworks for federated learning along with their favorite tools for data transformation and model training. However, as we highlighted earlier, the glue between these components remains a huge challenge and risk factor. If these networks are to grow and as more partners enter the ecosystem, the maintenance of such systems can very quickly get out of hand.
The best-of-breed approach for a safer, more rapid, and managed implementation is to use open platforms for federated & privacy-preserving data science. The core capabilities of such a platform provide an operational blueprint to enter secure data collaborations while maintaining flexibility with supporting already existing tech stacks, pipelines, data lakes, and data products.
Delivering on the Vision of Secure Data Ecosystems
The tools to create breakthrough innovation with AI are already available. We just need to use them in a responsible and governable way. With Apheris, a federated machine learning and analytics platform, enterprises can finally:
Increase quality and effectiveness of collaborative ML with standard processes
Drive consistency with integrated workflows across all tools and infrastructure, without sacrificing flexibility
Manage privacy-preserving data science as an enterprise capability
Safely and universally manage privacy-enhancing technologies and data science with enterprise controls across companies
Unlock privacy-preserving and federated MLOps at enterprise scale
Enable secure, continuous, and scalable feedback loops between testing and operationalizing of ML models or AI-enabled data products
As we develop Apheris, we plan to continuously extend it to other data models, databases, data science, and analytics tools, as well as the latest open-source privacy frameworks. Ultimately, federated MLOps sets the foundation for organizations to build and orchestrate federated data ecosystems on a global scale.