Managing datasets🔗
Datasets are registered to a Compute Gateway which is tied to a single organization. Registering a dataset connects the data to the Apheris product.
A dataset consist of several pieces of information:
- A path to the real data: either to an AWS S3 bucket e.g.:
s3://bucket-name/path/to/file.csv
or to an on prem location e.g.:file:///home/datastore/path/to/file.csv
- A path to dummy data (optional): Dummy Data is uploaded to the central Apheris product via the Governance Portal.
Important
Please note that a federated computation across multiple datasets requires that each dataset resides in a different Gateway.
Supported file formats🔗
One dataset corresponds to one real data file or data folder and (optionally) one dummy data file (or folder).
In case of multiple files, you can either
- bundle multiple files into a
.zip
archive or
- add a path to a data folder.
If the total amount of data gets comparatively large (e.g. image data files), we do recommend to use the folder approach instead of bundling into a .zip
file.
Bundling files into a .zip archive🔗
For example: You can specify as real data file one .zip
archive with the following content:
clinical\_data.zip
├── patients.csv
└── physicians.csv
When adding dummy data, ensure that the file structure and naming inside of the .zip
archive is the same as the real data files:
clinical\_data\_dummy.zip
├── patients.csv
└── physicians.csv
Using folders🔗
A dataset can also be a path to folder. In this case, asset policies for this dataset will cover any object in that folder. However, should a specific policy be created for one of the objects within that folder, then the asset policy for the individual data file will take precedence.
File types and extensions🔗
Apheris supports any file format and file extension.
For selected file type extensions, Apheris offers convenience features to aid Data Scientists and Data Scientists with exploring the data in a privacy-preserving manner:
-
Tabular
.csv
data files:CSV files should be submitted uncompressed with the following parameters:
* Delimiter should be
,
(comma)* Missing/NA data fields should be blank
* A header row is required
* Encoding should be
utf-8
without a BOM* If quote characters are required, they will be inferred, and should be
"
(double-quotes)* Decimal separators should be
.
(period)The following features are enabled:
* Automated extraction of metadata:
* Row count * Column information: * ID * NaN count * Unique value count * 5th and 95th percentiles
* Automated loading of dummy data as pandas DataFrame
-
NumPy
.npz
image data files:NPZ files should be submitted uncompressed, as per the numpy.savez() function.
The following features are enabled:
* Automated extraction of metadata:
* Image count * Resolution * Layers * Data type * Min and max values * Unique value count * NaN count
* Automated loading of dummy data as NumPy array