Skip to content

Data Harmonization🔗

Apheris provides support for data harmonization through a combination of built-in, privacy-preserving statistical analysis tools and customizable code workflows as models in the Apheris Model Registry. Our governance approval mechanisms and advanced storage capabilities, including the derived datasets feature, help ensure the security and integrity of your data.

Understanding that data harmonization often requires flexibility, Apheris offers a solution that allows Data Science Users to perform necessary analyses and transformations while creating a well-structured dataset that remains within the secure environment of the Data Custodian.

This documentation will outline how Apheris can improve your data harmonization pipelines and workflows.

Features🔗

The following features support industry data analysis and processing workflows while notably having the option to protect data privacy and keep all parties compliant, while operating in a federated environment.

Statistics Functions🔗

Our platform offers functions that enable the comparison of meta-parameters between different datasets. For instance, you can utilize our built-in statistical capabilities, such as TableOne, to analyze and compare the variance of one distribution against global metrics, including variance, mean, and more. Researchers can conduct this process in a federated manner, enabling thorough comparative analysis while ensuring data privacy.

Derived Datasets🔗

Data Science Users can write intermediate datasets to the Compute Gateway as part of a pre-processing pipeline. So, as part of the feature engineering process, if they are deriving certain additional features (experimental conditions, different chemistries, concentrations, proteins), these can be engineered as a new feature - or perhaps a normalization of an existing one - which will then be taken into consideration when the model training takes place.

You can use these derived datasets to pass pre-processed data into another downstream model.

Data Pre-processing Pipelines🔗

In the context of Federated Learning, the actual federated run will integrate all pre-processed derived datasets.

However, the pre-processing and generation of these derived datasets can occur independently and within separate codebases. Each dataset intended for the Federated Learning process may undergo distinct computation runs that apply tailored transformations to ensure all datasets harmonize according to their specific requirements. This approach is particularly beneficial for evaluating molecular data sourced from diverse origins.

This is achieved through the combination of using our custom model features alongside the ability to launch computations across multiple gateways concurrently.

Aggregation Customization🔗

In addition to our provided aggregation functions, dedicated aggregation functions can be written during the initial experimentation phase to account for the data's distribution. Our statistics functions can also detect this distribution without direct access to the data, supporting the data exploration phase.

Native Open Source Support🔗

It's important to note that our data science interface includes OSS integration from the outset. We natively integrate with both NVFlare and Flower, enabling users to leverage innovations aimed at addressing challenges related to concept and data drift. Additionally, libraries for detecting concept and data drift, such as Frouros, can be seamlessly integrated into your workflow.

Data Quality Scenarios and Use Cases🔗

Many scenarios require unique solutions when dealing with data harmonization.

Dataset Generation Metadata🔗

One example is capturing differences in generated datasets. With our pre-processing pipeline, you can run this as a tailored job per dataset. In some use cases where the differences are not captured in the metadata, it can be a viable approach to achieve protocol alignment among collaborating parties.

Data Distribution Details🔗

To validate differences in data distribution, first utilize our pre-built statistics libraries or write custom code to compare distributions. Next, use custom code to create features that capture these differences. You can also pre-normalize the data as needed before training a model, the results of these activities can be persisted within the Data Custodian environment using our derived datasets.

Concept Drift🔗

To address variations in medical practices, identify data that reliably reflects these differences, such as tables that show how the definition of "normal" blood pressure varies across regions. Implement a site-specific pre-processing pipeline to manage this data effectively. This approach may involve writing custom code and creating derived datasets. Additionally, leverage NLP capabilities to handle unstructured data. You are able to bring any combination of these workflows into Apheris using our custom model functionality.

Summary🔗

In summary, Apheris provides a comprehensive framework for data harmonization, integrating advanced statistical analysis, customizable workflows, and privacy-preserving features. With tools like derived datasets and tailored pre-processing pipelines, users can efficiently manage and analyze data while maintaining compliance and security.

The open-source support offers flexibility in addressing unique data challenges. Using these capabilities, Data Science Users can optimize workflows and create well-structured datasets, making Apheris an essential tool for modern data-driven projects.