Dummy data๐
Dummy data is a synthetic version of the real data designed to resemble and serve as a substitute for the real data. The dummy data is viewable for beneficiaries to allow data discovery and simulating statistical or ML computations. Therefore, dummy data files should not include sensitive data.
Meaningful dummy data is a key ingredient of federated analytics and machine learning collaborations. Good dummy data not only helps an Data Scientist to understand the real dataset and prepare the analysis, but dummy data also benefits data custodian organizations by
- Supporting good science: Dummy data allows Data Consumers to test their algorithms and models. With such results shared, the data custodian organization can assess the validity of the chosen algorithms and ensure the algorithms are appropriate for the research purpose.
- Minimizing system load: A well-prepared compute spec, tested on dummy data, minimizes the number of iterations needed by an Data Scientist on the real data. Thus minimizing the data processing needed on a data custodian's system.
What is meaningful dummy data๐
Meaningful dummy should ideally be representative of the real data.
For tabular data, dummy data should have the following properties:
- Same data schema as real data: dummy data should contain the same relational scheme as the real data. For tabular data, this means that the dummy data has the same number of columns and the same column names as the real data.
- Same data types as real data: dummy data should contain the same data types. For categorical data types, make sure all categories are represented.
- Representation of data quality: If the real data has missing values or NaN values, the dummy data should capture this. This indicates to the beneficiaries any specifics of the real data.
- Sufficient number of records: Ideally you should provide a sufficient number of example records in the dummy data. For tabular data, we recommend adding at least 30-50 rows.
- (Optional) Similar statistical properties: Meaningful dummy data should have similar statistical properties.
For images (in this example medical images), you can follow these guidelines:
- Create synthetic images that closely resemble real medical data of the type contained in the dataset, but exclude identifiable patient details.
- Ensure the dummy data reflects the diversity and variability of actual medical scenarios.
- Remove all personal identifiers from images to comply with privacy laws and regulations like HIPAA or GDPR.
- Use tools or algorithms designed for generating synthetic medical imagery to maintain data integrity.
- Validate the realism and relevance of the dummy data with experts in medical imaging.
Creating meaningful dummy data๐
To create meaningful data, you can do the following:
Manual creation๐
Use the data schema of your real data and manually add entries to it. Check that each column conforms to properties of โmeaningful dummy dataโ described above.
Create synthetic data๐
Synthetic data is typically the most meaningful dummy data as it is very close to the real data. There are many open source tools available to help you create synthetic data. Apheris can provide hands-on support to help you create dummy data.
Uploading dummy data๐
When you create a dummy data file and upload it during dataset registration, be aware that the dummy data file is always uploaded to the Apheris product and beneficiaries defined in asset policies can view the dummy data.