Skip to content

Persisting Derived Datasets🔗

Derived Dataset🔗

A Derived Dataset is a new dataset created by processing, transforming, or combining one or more existing datasets as a result of a computation.

It is possible to create Derived Datasets on Compute Gateways that support it. A Derived Dataset is stored on the Compute Gateway and can be reused in future computations.

Some use cases include:

  • Long-running pre-processing tasks
  • Data cleaning/formatting for use in downstream jobs

This approach helps generate more insightful and valuable data representations for analysis, modeling, and other tasks, while adhering to the permissions inherited from the parent datasets.

Key distinctions of a Derived Dataset compared to a regular dataset:

  • It includes a list of parent datasets, indicating the sources from which it was generated.
  • It inherits asset policies and permissions from its parent datasets.
  • It cannot serve as a parent to another Derived Dataset.
  • Both parent and Derived Datasets are accessible within the same Compute Gateway.

Adding a New Derived Dataset🔗

You can persist your pre-processed data on the Compute Gateway using the Apheris APIs by simply performing a POST request against the DAL, using the URL and token from the environment

import requests
import os

dal_url = os.environ["APH_API_DAL_URL"]
dal_token = os.environ["APH_DAL_TOKEN"]

...
with open("file_path", "rb") as f:
    contents = f.read()
headers = {"Authorization": f"Bearer {dal_token}"}
r = requests.put(f"http://{dal_url}/dataset/{name}/{path}", data=contents, headers=headers)

...  
- name is the unique name you choose for the Derived Dataset. - path is the path where you want to persist your content.

Note: Ensure that name is unique, does not contain spaces or special characters, and consists only of alphanumeric values.

Viewing the Derived Dataset🔗

  • Apheris CLI:
    $  apheris datasets list
    
    +-----+-----------------------------------------------------+---------------+------------------------------+
    | idx |                       dataset_id                    |  organization |        data custodian        |
    +-----+-----------------------------------------------------+---------------+------------------------------+
    |  1  |              medical_task01_Org 1_Gateway 1         |     Org 1     |          John Smith          |
    +-----+---------------------------------------------------------+---------------+--------------------------+
    
  • Governance Portal: Navigate to the dataset's page within the Governance Portal to view the Derived Datasets.