Skip to content

Register & unregister datasetsπŸ”—

Datasets are registered by using the Governance Portal. To register a dataset you need the following:

  • Owner or Data Steward role: Only users having Owner or Data Steward roles can add datasets. Please see Managing users and roles for more details.
  • Real data: The Apheris product supports any file format and file extension. For details, see Supported file formats.
  • Dummy data: Although this is optional, Apheris strongly recommends adding dummy data. For more information, see Uploading dummy data.

Register a datasetπŸ”—

First log in to the Governance Portal. The Governance Portal runs from the browser (https://web.{{ no such element: dict object['app'] }}).

Select the Datasets tab on the left. To register a new dataset, click on Register Dataset on top

datasets-overview-page.png

and fill out the relevant fields on the registration form. Please note, that real data is never uploaded. You only have to specify a path to the dataset within your environment.

register-datasets.png

Describing the datasetπŸ”—

Under Describe dataset provide a Name and Description:

  • Name: The name of the dataset has to be unique within the Compute Gateways organization. The name should allow users within your organization to refer to it by name. Names can not be changed to avoid confusion or issues with Asset Policies governing a dataset.
  • Description: Adding a meaningful description of the dataset is important. A good description will help data consumers to identify relevant datasets and thereby reduce friction. The description of a dataset can be viewed via the Apheris CLI. Therefore the description should not include sensitive data. Please see describing datasets for guidelines and best practices.

Adding dummy dataπŸ”—

After describing the dataset, you can also add a dummy data file.

In the section Dummy data, do the following:

Important

A dummy data file is ALWAYS saved to the Apheris product regardless of the Compute Gateway setup. Make sure the dummy data file does not contain sensitive data.

Although attaching dummy data is optional, it can have the following benefits:

  • it is representative of the real data.
  • helps to understand as well as use the real data;
  • allows a Data Scientist to understand the characteristics of the data and its structure without disclosing sensitive data points;
  • helps Data Scientists to prepare their analysis and trainings and therefore reduces the load on a data custodians environment.

For more information about dummy data, see Creating dummy data.

Adding real dataπŸ”—

Cloud setupπŸ”—

Make sure that you have uploaded the real data file to the S3 bucket of your organization. For more information, see Uploading objects in AWS documentation.

As soon as you have your data uploaded, under Real data, provide a file location on the S3 bucket.

For example, s3://bucket-name/path/to/file.csv.

On-prem setupπŸ”—

Make sure that you have uploaded the real data file to your on-prem server under /home/datastore folder that was created and set as a mounting point for your Compute Gateway during the installation.

As soon as you have your data uploaded, under Real data, provide a file or folder location on your on-prem server. For example:

  • File: file:///home/datastore/path/to/file.csv
  • Folder: file:///home/datastore/path/to/folder/

Please mind the β€œ///” file:///home/datastore prefix and make sure to point to the correct path.

After you specified the path to the real data file, click Register to finish dataset registration. The dataset will now appear in the list of datasets and a dataset ID will be created.

List of datasetsπŸ”—

To filter through datasets you have three filters available:

  • Registered by me: Shows all datasets you registered yourself.
  • Registered by [Your Organization name]: Shows all datasets within your organization to which you have access..
  • Shared by another organization: Shows all datasets you have been granted access to by being a beneficiary in an asset policy of another organization.

list of datasets

The list of datasets includes the following information:

  • Name of dataset;
  • Name of the dataset creator;
  • Name of the Data Custodian organization
  • Days since last update

Dataset IDπŸ”—

The ID of a dataset is located in the dataset overview. Click a dataset in the list to open the dataset overview. The dataset ID is in the Details section.

dataset-id-view.png

Editing a datasetπŸ”—

Once you have registered a dataset, you can make changes to the dataset by clicking Edit dataset. You can modify the following:

  • Change the description
  • Register a new real data file
  • Upload a new dummy data file

However, you cannot change the name of the dataset nor the ID of the dataset.

Unregistering a datasetπŸ”—

If you have an Owner role you can unregister any dataset within your organization. If you have a Data Steward role, you can only unregister the datasets you registered yourself.

To unregister datasets, simply navigate to the Datasets page and select the dataset you want to unregister. After clicking β€œUnregister Dataset” you will be prompted for confirmation.

unregister-dataset.png

Datasets on organizations with multiple gatewaysπŸ”—

When your organization has multiple gateways you will see the gateway in the dataset overview. The lock icon next to the gateway indicates if this gateway requires signing of asset policies.

Dataset list with different gateways

To register a dataset you need to specify the gateway your dataset is located on. Registering datasets for organizations with multiple gateways This selection can not be edited later since it is part of the datasets identifier. Editing datasets for organizations with multiple gateways