Skip to content

Describing datasetsπŸ”—

When you register a dataset via the Apheris Governance Portal, you will be asked to describe the dataset. With the description that you add, you can help Data Consumers to find and make sense of your dataset, as well as interpret the data correctly.

Why a description is importantπŸ”—

Data consumers will never have direct access to the real data, therefore it is important to help a data consumer to understand the data, it’s standards and encoding. This all helps a Data Consumer to identify the right dataset and prepare the analysis.

Even with access to good dummy data, certain aspects of a dataset can remain difficult or downright impossible to understand without a good description. A few examples:

  • Origins and scope of the data
  • Important data licensing aspects
  • Previously applied transformations or preprocessing
  • Domain-specific jargon and abbreviations

A good dataset description, also known as "data dictionary" or "readme", clarifies these aspects and complements also dummy data. A good description and dummy data together make the real data meaningful and valuable for beneficiaries.

What makes a good descriptionπŸ”—

A good dataset description contains the following elements where relevant:

Overall definition

  • Abstract: a brief narrative describing the data, its origins, scope and purpose or intended use
  • Methodologies: assumptions made while collecting the data, transformations or calculations applied to the raw data (if any), date range when the data was collected
  • Dataset details:

    * file format,

    * creator or owner of the data,

    * publication date,

    * identifier (DOI, PURL, handle),

    * license,

    * version

  • Data standards: State any known content or terminology standards of the dataset and where it might differ (example standards for healthcare HL7 or SNOMED CT)

Component / data element descriptions

  • For image data: describe how the images are organized, and where detailed metadata can be found.
  • For Tabular data: describe each column (field) and what it contains.

Tip

For tabular data we recommend the following structure to describe important characteristics for each column:

Column header: Description | Units of measure | Formats | Values | Meaning | Additional comments

Explanation and examples of recommended column characteristics:

Column Characteristic Explanation Examples
Column header Column identifier used in the raw data full_name
Description Human-readable description The full name of the patient
Units of measure Units of measure and precision Measured in meters, rounded up to the nearest .01 meter
Formats Data type or format 64 bit Float
Values All valid or allowed values. M, F
Any integer between 1 and 1000
Meaning Any non-trivial codes, symbols or abbreviations used in the values themselves. 1 = survived, 0 = deceased
LPFV = last patient first visit
Additional comments Additional relevant information, e.g. indicate if the column is required or can be empty optional column

Example of a good descriptionπŸ”—

An example for a well-written description of tabular data from Smithsonian Data Management Best Practices - Describing Your Data: Data Dictionaries:

File 1: Amendment seed packets and fungi_all.txt

This CSV includes the numbers of protocorms recovered from seedpackets exposed to amendment with different organic amendments, compared to no amendment. Data were collected 2010-04-02 and 2010-04-08 with results published in the paper β€œtitle of paper.” Missing data are indicated by a β€œ.”. Data were collected by M------- and R-------. Questions should be directed to M--------.

Column headings:

  • Species: The orchid species of seeds added to the plot in seedpacket. Goodyera=Goodyera pubescens; Liparis=Liparis liliifolia; Tipularia=Tipularia discolor
  • Site: Designated numerically 1-6. All sites are forest stands at the Smithsonian Environmental Research Center, Edgewater, Maryland, USA. Sites 1-3 are old stands and 4-6 are young stands (see Siteage, below).
  • Subplot: Designates the subplot location within each site. Thirty-six subplots were arranged in a square with columns labeled A-F and rows labeled 1-6.
  • Siteage: Old=120-150year old forest. Young-50-70year old forest.
  • Treatment: The amendment added to a subplot (Leaves=tulip poplar leaf litter; Wood=chipped fresh tulip poplar wood). Subplots with no amendment added are designated Control.
  • Inoculated?: Designates whether mycorrhizal host fungi were inoculated into the subplot.
  • fungusyn: Indicates whether appropriate host fungi were detected (1) or not (0) using PCR amplification of the soil in the subplot.
  • fungusInt: A semi-quantitative measure of the abundance of appropriate host fungi. The intensity of fluorescence by a post-PCR gel band 0=no band visible to 3=intensely bright fluorescence.
  • fung2YN: For Tipularia discolor, indicates whether an appropriate host fungus was detected (1) or not (0) using PCR amplification of the soil in the subplot using a second primer set (TipC2F/TipR) that detects an appropriate host fungus not detected by the first primer set (TipC1F/TipR).