Skip to content

Managing datasets🔗

Datasets are registered to a Compute Gateway which is tied to a single organization. Registering a dataset connects the data to the Apheris product.

A dataset consist of several pieces of information:

  • A path to the real data: either to an AWS S3 bucket e.g.: s3://bucket-name/path/to/file.csv or to an on prem location e.g.:file:///home/datastore/path/to/file.csv
  • A path to dummy data (optional): Dummy Data is uploaded to the central Apheris product via the Governance Portal.

Important

Please note that a federated computation across multiple datasets requires that each dataset resides in a different Gateway.

Supported file formats🔗

One dataset corresponds to one real data file or data folder and (optionally) one dummy data file (or folder).

In case of multiple files, you can either

  • bundle multiple files into a .zip archive or
  • add a path to a data folder.

If the total amount of data gets comparatively large (e.g. image data files), we do recommend to use the folder approach instead of bundling into a .zip file.

Bundling files into a .zip archive🔗

For example: You can specify as real data file one .zip archive with the following content:

clinical\_data.zip
├── patients.csv
└── physicians.csv

When adding dummy data, ensure that the file structure and naming inside of the .zip archive is the same as the real data files:

clinical\_data\_dummy.zip
├── patients.csv
└── physicians.csv

Using folders🔗

A dataset can also be a path to folder. In this case, asset policies for this dataset will cover any object in that folder. However, should a specific policy be created for one of the objects within that folder, then the asset policy for the individual data file will take precedence.

File types and extensions🔗

Apheris supports any file format and file extension.

For selected file type extensions, Apheris offers convenience features to aid Data Scientists and Data Scientists with exploring the data in a privacy-preserving manner:

  1. Tabular .csv data files:

    CSV files should be submitted uncompressed with the following parameters:

    * Delimiter should be , (comma)

    * Missing/NA data fields should be blank

    * A header row is required

    * Encoding should be utf-8 without a BOM

    * If quote characters are required, they will be inferred, and should be " (double-quotes)

    * Decimal separators should be . (period)

    The following features are enabled:

    * Automated extraction of metadata:

      *   Row count
    
      *   Column information:
    
          *   ID
    
          *   NaN count
    
          *   Unique value count
    
          *   5th and 95th percentiles
    

    * Automated loading of dummy data as pandas DataFrame

  2. NumPy .npz image data files:

    NPZ files should be submitted uncompressed, as per the numpy.savez() function.

    The following features are enabled:

    * Automated extraction of metadata:

      *   Image count
    
      *   Resolution
    
      *   Layers
    
      *   Data type
    
      *   Min and max values
    
      *   Unique value count
    
      *   NaN count
    

    * Automated loading of dummy data as NumPy array