Managing datasets🔗
Datasets are registered to a Compute Gateway which is tied to a single organization. Registering a dataset connects the data to the Apheris product.
A dataset consist of several pieces of information:
- A path to the real data: either to an AWS S3 bucket e.g.:
s3://bucket-name/path/to/file.csv
or to an on prem location e.g.:file:///home/datastore/path/to/file.csv
- A path to dummy data (optional): Dummy Data is uploaded to the central Apheris product via the Governance Portal.
Important
Please note that a federated computation across multiple datasets requires that each dataset resides in a different Gateway.
Supported file formats🔗
One dataset corresponds to one real data file or data folder and (optionally) one dummy data file (or folder).
In case of multiple files, you can either
- bundle multiple files into a
.zip
archive or - add a path to a data folder.
If the total amount of data gets comparatively large (e.g. image data files), we do recommend to use the folder approach instead of bundling into a .zip
file.
Bundling files into a .zip archive🔗
For example: You can specify as real data file one .zip
archive with the following content:
clinical\_data.zip
├── patients.csv
└── physicians.csv
When adding dummy data, ensure that the file structure and naming inside the .zip
archive is the same as the real data files:
clinical\_data\_dummy.zip
├── patients.csv
└── physicians.csv
Using folders🔗
A dataset can also be a path to folder. In this case, asset policies for this dataset will cover any object in that folder. However, should a specific policy be created for one of the objects within that folder, then the asset policy for the individual data file will take precedence.
File types and extensions🔗
Apheris supports any file format and file extension.
For selected file type extensions, Apheris offers convenience features to aid Data Scientists and Data Scientists with exploring the data in a privacy-preserving manner:
- Tabular
.csv
data files:
CSV files should be submitted uncompressed with the following parameters:- Delimiter should be
,
(comma) - Missing/NA data fields should be blank
- A header row is required
- Encoding should be
utf-8
without a BOM - If quote characters are required, they will be inferred, and should be
"
(double-quotes) - Decimal separators should be
.
(period)
The following features are enabled: - Automated extraction of metadata:
- Row count
- Column information:
- ID
- NaN count
- Unique value count
- 5th and 95th percentiles
- Automated loading of dummy data as pandas DataFrame
- Delimiter should be
- NumPy
.npz
image data files:
NPZ files should be submitted uncompressed, as per the numpy.savez() function.
The following features are enabled:- Automated extraction of metadata:
- Image count
- Resolution
- Layers
- Data type
- Min and max values
- Unique value count
- NaN count
- Automated loading of dummy data as NumPy array
- Automated extraction of metadata: