Managing datasets🔗

Datasets are registered to a Compute Gateway which is tied to a single organization. Registering a dataset connects the data to the Apheris product.

A dataset consist of several pieces of information:

A path to the real data: either to an AWS S3 bucket e.g.: s3://bucket-name/path/to/file.csv or to an on prem location e.g.:file:///home/datastore/path/to/file.csv
A path to dummy data (optional): Dummy Data is uploaded to the central Apheris product via the Governance Portal.

Important

Please note that a federated computation across multiple datasets requires that each dataset resides in a different Gateway.

Supported file formats🔗

One dataset corresponds to one real data file or data folder and (optionally) one dummy data file (or folder).

In case of multiple files, you can either

bundle multiple files into a .zip archive or
add a path to a data folder.

If the total amount of data gets comparatively large (e.g. image data files), we do recommend to use the folder approach instead of bundling into a .zip file.

Bundling files into a .zip archive🔗

For example: You can specify as real data file one .zip archive with the following content:

clinical\_data.zip
├── patients.csv
└── physicians.csv

When adding dummy data, ensure that the file structure and naming inside the .zip archive is the same as the real data files:

clinical\_data\_dummy.zip
├── patients.csv
└── physicians.csv

Using folders🔗

A dataset can also be a path to folder. In this case, asset policies for this dataset will cover any object in that folder. However, should a specific policy be created for one of the objects within that folder, then the asset policy for the individual data file will take precedence.

File types and extensions🔗

Apheris supports any file format and file extension.

For selected file type extensions, Apheris offers convenience features to aid Data Scientists and Data Scientists with exploring the data in a privacy-preserving manner:

Tabular .csv data files:
CSV files should be submitted uncompressed with the following parameters:
- Delimiter should be , (comma)
- Missing/NA data fields should be blank
- A header row is required
- Encoding should be utf-8 without a BOM
- If quote characters are required, they will be inferred, and should be " (double-quotes)
- Decimal separators should be . (period)
  
  The following features are enabled:
- Automated extraction of metadata:
  - Row count
  - Column information:
    - ID
    - NaN count
    - Unique value count
    - 5th and 95th percentiles
- Automated loading of dummy data as pandas DataFrame
NumPy .npz image data files:
NPZ files should be submitted uncompressed, as per the numpy.savez() function.
The following features are enabled:
- Automated extraction of metadata:
  - Image count
  - Resolution
  - Layers
  - Data type
  - Min and max values
  - Unique value count
  - NaN count
- Automated loading of dummy data as NumPy array