Skip to content

Managing datasets🔗

Datasets are registered to a Compute Gateway which is tied to a single organization. Registering a dataset connects the data to the Apheris product.

A dataset consist of several pieces of information:

  • A path to the real data: either to an AWS S3 bucket e.g.: s3://bucket-name/path/to/file.csv or to an on prem location e.g.:file:///home/datastore/path/to/file.csv
  • A path to dummy data (optional): Dummy Data is uploaded to the central Apheris product via the Governance Portal.

Important

Please note that a federated computation across multiple datasets requires that each dataset resides in a different Gateway.

Supported file formats🔗

One dataset corresponds to one real data file or data folder and (optionally) one dummy data file (or folder).

In case of multiple files, you can either

  • bundle multiple files into a .zip archive or
  • add a path to a data folder.

If the total amount of data gets comparatively large (e.g. image data files), we do recommend to use the folder approach instead of bundling into a .zip file.

Bundling files into a .zip archive🔗

For example: You can specify as real data file one .zip archive with the following content:

clinical\_data.zip
├── patients.csv
└── physicians.csv

When adding dummy data, ensure that the file structure and naming inside the .zip archive is the same as the real data files:

clinical\_data\_dummy.zip
├── patients.csv
└── physicians.csv

Using folders🔗

A dataset can also be a path to folder. In this case, asset policies for this dataset will cover any object in that folder. However, should a specific policy be created for one of the objects within that folder, then the asset policy for the individual data file will take precedence.

File types and extensions🔗

Apheris supports any file format and file extension.

For selected file type extensions, Apheris offers convenience features to aid Data Scientists and Data Scientists with exploring the data in a privacy-preserving manner:

  • Tabular .csv data files:
    CSV files should be submitted uncompressed with the following parameters:
    • Delimiter should be , (comma)
    • Missing/NA data fields should be blank
    • A header row is required
    • Encoding should be utf-8 without a BOM
    • If quote characters are required, they will be inferred, and should be " (double-quotes)
    • Decimal separators should be . (period)

      The following features are enabled:
    • Automated extraction of metadata:
      • Row count
      • Column information:
        • ID
        • NaN count
        • Unique value count
        • 5th and 95th percentiles
    • Automated loading of dummy data as pandas DataFrame
  • NumPy .npz image data files:
    NPZ files should be submitted uncompressed, as per the numpy.savez() function.
    The following features are enabled:
    • Automated extraction of metadata:
      • Image count
      • Resolution
      • Layers
      • Data type
      • Min and max values
      • Unique value count
      • NaN count
    • Automated loading of dummy data as NumPy array