Describing datasets🔗

When you register a dataset via the Apheris Governance Portal, you will be asked to describe the dataset. With the description that you add, you can help Data Consumers to find and make sense of your dataset, as well as interpret the data correctly.

Why a description is important🔗

Data consumers will never have direct access to the real data, therefore it is important to help a data consumer to understand the data, it’s standards and encoding. This all helps a Data Consumer to identify the right dataset and prepare the analysis.

Even with access to good dummy data, certain aspects of a dataset can remain difficult or downright impossible to understand without a good description. A few examples:

Origins and scope of the data

Important data licensing aspects

Previously applied transformations or pre-processing

Domain-specific jargon and abbreviations

A good dataset description, also known as "data dictionary" or "readme", clarifies these aspects and complements also dummy data. A good description and dummy data together make the real data meaningful and valuable for beneficiaries.

Tip

Markdown is supported in dataset descriptions.

What makes a good description🔗

A good dataset description contains the following elements where relevant:

Overall definition🔗

Abstract: a brief narrative describing the data, its origins, scope and purpose or intended use

Methodologies: assumptions made while collecting the data, transformations or calculations applied to the raw data (if any), date range when the data was collected

Dataset details:
- file format,
- creator or owner of the data,
- publication date,
- identifier (DOI, PURL, handle),
- license,
- version

Data standards: State any known content or terminology standards of the dataset and where it might differ (example standards for healthcare HL7 or SNOMED CT)

Component / data element descriptions🔗

For image data: describe how the images are organized, and where detailed metadata can be found.

For Tabular data: describe each column (field) and what it contains.

Tip

For tabular data we recommend the following structure to describe important characteristics for each column:

Explanation and examples of recommended column characteristics:

Column Characteristic	Explanation	Examples
Column header	Column identifier used in the raw data	full_name
Description	Human-readable description	The full name of the patient
Units of measure	Units of measure and precision	Measured in meters, rounded up to the nearest .01 meter
Formats	Data type or format	64 bit Float
Values	All valid or allowed values.	M, F Any integer between 1 and 1000
Meaning	Any non-trivial codes, symbols or abbreviations used in the values themselves.	1 = survived, 0 = deceased LPFV = last patient first visit
Additional comments	Additional relevant information, e.g. indicate if the column is required or can be empty	optional column

Example of a good description🔗

An example for a well-written description of tabular data from Smithsonian Data Management Best Practices - Describing Your Data: Data Dictionaries:

File 1: Amendment seed packets and fungi_all.txt

This CSV includes the numbers of protocorms recovered from seedpackets exposed to amendment with different organic amendments, compared to no amendment. Data were collected 2010-04-02 and 2010-04-08 with results published in the paper “title of paper.” Missing data are indicated by a “.”. Data were collected by M------- and R-------. Questions should be directed to M--------.

Column headings:

Species: The orchid species of seeds added to the plot in seedpacket. Goodyera=Goodyera pubescens; Liparis=Liparis liliifolia; Tipularia=Tipularia discolor

Site: Designated numerically 1-6. All sites are forest stands at the Smithsonian Environmental Research Center, Edgewater, Maryland, USA. Sites 1-3 are old stands and 4-6 are young stands (see Siteage, below).

Subplot: Designates the subplot location within each site. Thirty-six subplots were arranged in a square with columns labeled A-F and rows labeled 1-6.

Siteage: Old=120-150year old forest. Young-50-70year old forest.

Treatment: The amendment added to a subplot (Leaves=tulip poplar leaf litter; Wood=chipped fresh tulip poplar wood). Subplots with no amendment added are designated Control.

Inoculated?: Designates whether mycorrhizal host fungi were inoculated into the subplot.

fungusyn: Indicates whether appropriate host fungi were detected (1) or not (0) using PCR amplification of the soil in the subplot.

fungusInt: A semi-quantitative measure of the abundance of appropriate host fungi. The intensity of fluorescence by a post-PCR gel band 0=no band visible to 3=intensely bright fluorescence.

fung2YN: For Tipularia discolor, indicates whether an appropriate host fungus was detected (1) or not (0) using PCR amplification of the soil in the subplot using a second primer set (TipC2F/TipR) that detects an appropriate host fungus not detected by the first primer set (TipC1F/TipR).