Register & unregister datasets🔗

Datasets are registered by using the Governance Portal. To register a dataset you need the following:

Owner or Data Steward role: Only users having Owner or Data Steward roles can add datasets. Please see Managing users and roles for more details.

Real data: The Apheris product supports any file format and file extension. For details, see Supported file formats.

Dummy data: Although this is optional, Apheris strongly recommends adding dummy data. For more information, see Uploading dummy data.

Register a dataset🔗

First log in to the Governance Portal. The Governance Portal runs from the browser.

Select the Datasets tab on the left. To register a new dataset, click on Register Dataset on top

and fill out the relevant fields on the registration form. Please note, that real data is never uploaded. You only have to specify a path to the dataset within your environment.

Describing the dataset🔗

Under Describe dataset provide a Name and Description:

Name: The name of the dataset has to be unique within the Compute Gateways organization. The name should allow users within your organization to refer to it by name. Names can not be changed to avoid confusion or issues with Asset Policies governing a dataset.

Description: Adding a meaningful description of the dataset is important. A good description will help data consumers to identify relevant datasets and thereby reduce friction. The description of a dataset can be viewed via the Apheris CLI. Therefore the description should not include sensitive data. Please see describing datasets for guidelines and best practices.

Adding dummy data🔗

After describing the dataset, you can also add a dummy data file.

In the section Dummy data, do the following:

Important

A dummy data file is ALWAYS saved to the Apheris product regardless of the Compute Gateway setup. Make sure the dummy data file does not contain sensitive data.

Note

All dummy data files are scanned for malware content once uploaded. Potential alerts are handled by the Apheris security team.

Although attaching dummy data is optional, it can have the following benefits:

it is representative of the real data.

helps to understand as well as use the real data;

allows a Data Scientist to understand the characteristics of the data and its structure without disclosing sensitive data points;

helps Data Scientists to prepare their analysis and trainings and therefore reduces the load on a data custodians environment.

For more information about dummy data, see Creating dummy data.

Adding real data🔗

Cloud setup🔗

Make sure that you have uploaded the real data file to the S3 bucket of your organization. For more information, see Uploading objects in AWS documentation.

As soon as you have your data uploaded, under Real data, provide a file location on the S3 bucket.

For example, s3://bucket-name/path/to/file.csv.

On-prem setup🔗

Make sure that you have uploaded the real data file to your on-prem server under /home/datastore folder that was created and set as a mounting point for your Compute Gateway during the installation.

As soon as you have your data uploaded, under Real data, provide a file or folder location on your on-prem server. For example:

File: file:///home/datastore/path/to/file.csv

Folder: file:///home/datastore/path/to/folder/

Please mind the “///” file:///home/datastore prefix and make sure to point to the correct path.

After you specified the path to the real data file, click Register to finish dataset registration. The dataset will now appear in the list of datasets and a dataset ID will be created.

List of datasets🔗

To filter through datasets you have three filters available:

Registered by me: Shows all datasets you registered yourself.

Registered by [Your Organization name]: Shows all datasets within your organization to which you have access.

Shared by another organization: Shows all datasets you have been granted access to by being a beneficiary in an asset policy of another organization.

The list of datasets includes the following information:

Name of dataset;

Name of the dataset creator;

Name of the Data Custodian organization

Days since last update

Dataset ID🔗

The ID of a dataset is located in the dataset overview. Click a dataset in the list to open the dataset overview. The dataset ID is in the Details section.

Editing a dataset🔗

Once you have registered a dataset, you can make changes to the dataset by clicking Edit dataset. You can modify the following:

Change the description

Register a new real data file

Upload a new dummy data file

However, you cannot change the name of the dataset nor the ID of the dataset.

Unregistering a dataset🔗

If you have an Owner role you can unregister any dataset within your organization. If you have a Data Steward role, you can only unregister the datasets you registered yourself.

To unregister datasets, simply navigate to the Datasets page and select the dataset you want to unregister. After clicking “Unregister Dataset” you will be prompted for confirmation.

Rename a dataset🔗

As a security measure to ensure asset policy integrity, renaming a dataset is not supported. If you need to change the name of a dataset you must unregister and re-register it first.

Important

When unregistering a dataset associated asset policies will be lost and need to be re-created.

To change the name of a dataset:

Unregister the dataset
Re-register the dataset with new name

Datasets on organizations with multiple gateways🔗

When your organization has multiple gateways you will see the gateway in the dataset overview. The lock icon next to the gateway indicates if this gateway requires signing of asset policies.

To register a dataset you need to specify the gateway your dataset is located on. This selection can not be edited later since it is part of the dataset's identifier.