This site is powered and sponsored by OnlineGroups.net where groups can collaborate easily using email.

Biodata Management Guide Reference

Working Draft, 22 August 2013 by Dan Randow. Licensed under CC BY 3.0. with financial assistance from the Terrestrial and Freshwater Biodiversity Information System (TFBIS) Programme (Project 263: Biodata Management Framework: Phase Two), and support from Horizons Regional Council and Dataversity.

This document explains and provides overall maturity criteria for the following three key concepts that are used in the Guide.

Maturity Levels

This Guide defines five levels of maturity of biodata management.

Data Management Maturity

Mature data management aims to achieve both of the following goals.

The data available should be useful for the purpose for which it was collected, and for other purposes that may arise. At the same time, the costs of managing data must be minimised.

At lower levels of maturity, effectiveness is more important than efficiency. There is no point in using technology to automate management that is not fit for purpose. Good management can provide useful data, even if basic technology is used. Once fitness for purpose is consistently being achieved, investments in greater maturity aim to increase efficiency.

Why Mature Data Management is Important

The short video Data Sharing and Management Snafu in 3 Short Acts provides a good illustration of the problems that can be caused by poor data management.

Other Data Management Guides

In The Zen of Open Data, Chris McDowall provides a poetic explanation of the principles of good data management.

The Data Management Rollout at Oxford (DaMaRO) Project Research Data Management Training Materials provide training and reference materials for improving research data-management.

Fragmented

Maturity Levels | Data Management Activities | Maturity Factors
Fragmented | Improvised | Managed | Automated | Integrated

[Description]

Fragmented Data Management across Activities

Fragmented Data Management across Maturity Factors

Improvised

Maturity Levels | Data Management Activities | Maturity Factors
Fragmented | Improvised | Managed | Automated | Integrated

[Description]

[Level] Data Management across Activities

[Level] Data Management across Maturity Factors

Managed

Maturity Levels | Data Management Activities | Maturity Factors
Fragmented | Improvised | Managed | Automated | Integrated

[Description]

[Level] Data Management across Activities

[Level] Data Management across Maturity Factors

Automated

Maturity Levels | Data Management Activities | Maturity Factors
Fragmented | Improvised | Managed | Automated | Integrated

[Description]

[Level] Data Management across Activities

[Level] Data Management across Maturity Factors

Integrated

Maturity Levels | Data Management Activities | Maturity Factors
Fragmented | Improvised | Managed | Automated | Integrated

[Description]

[Level] Data Management across Activities

[Level] Data Management across Maturity Factors

Data Management Activities

This Guide defines five biodata management activities for which maturity can be assessed.

Together, these activities account for the entire data life-cycle. The activities may be carried out in linear or cyclic ways. For example, analysis may produce new data that is ingested.

Maturity of biodata management should be consistent across all data management activities.

Capture

Maturity Levels | Data Management Activities | Maturity Factors
Capture | Ingest | Store | Share | Analyse

Definition

Record data in the field.

Description

Most biodata is captured in the field. Some data is captured from samples, or from aerial or satellite images.

Data is captured in ways ranging from methodical regional surveys to casual observations of an individual of a species. Location, for example, may be described by directions, by an address, by a single set of coordinates, or by a detailed extent defined by multiple coordinates.

The survey method used is outside the scope of this Guide. Whatever method is used, the aim of the data capture process is to ensure that the data that is acquired is recorded without loss.

Field systems should support the methodical comprehensive capture of data and data about the survey itself, and should provide data that facilitates the survey process.

Systems for data capture should enable Field Staff to achieve the following.

User Stories

Maturity Criteria

[Activity] across Maturity Levels

[Activity] across Maturity Factors

Useful Resources

Ingest

Maturity Levels | Data Management Activities | Maturity Factors
Capture | Ingest | Store | Share | Analyse

Definition

Introduce data into primary repository.

Description

Ingestion is the process of introducing data into a primary repository. The data may be ingested with the intention of long term storage, or for the purpose of a specific task. Even in the latter case, a persistent record of the data that was used means that the analysis can be audited, if required.

The data being ingested may be newly captured, or it may be legacy data migrated from another repository.

The data being ingested may continue to reside in a remote repository. Data from a remote repository may be duplicated when it is ingested. Alternatively, the data itself may not be introduced at all but metadata describing it and pointing to its location may be introduced. Finally, in some cases a dynamic connection may be established with a dataset in a remote repository. Such a connection may allow new data that is added to the dataset to automatically become available to the local system.

The ingestion process provides an opportunity to enhance the data. Enhancements can include validation, β€œcleaning up” and standardisation of data content, standardising the data definitions and structure, adding metadata, and expert verification of the data.

Data may be ingested using manual processes and ad hoc tools, such as spreadsheets. More sophisticated systems have an ingestion workflow integrated with the primary data repository and catalogue.

Whatever the source of the data and the means of ingesting it, the aim of ingesting data is to make it easy to find and use the data.

Systems for ingesting data should meet the following criteria.

User Stories

Maturity Criteria

[Activity] across Maturity Levels

[Activity] across Maturity Factors

Useful Resources

Store

Maturity Levels | Data Management Activities | Maturity Factors
Capture | Ingest | Store | Share | Analyse

Definition

Retain data for the long term.

Description

Storage refers to the long term retention of data. It requires a data repository and a data catalogue.

The data repository stores the data. It can be made up of a number of separate repositories. Ideally, a limited and explicitly defined set of repositories is designated for data storage.

A single data catalogue should contain records of each dataset that is stored in the repository, with information about the dataset and how it can be used.

The aim of data storage is to ensure that data and metadata are available when they are needed.

The technology used for the data repository could be shelves or filing cabinets, or folders in a file system or in an online repository such as Google Drive or DropBox. It could be an EDMS, custom database, digital asset management system or a combination of these.

The data catalogue may use a spreadsheet or a specialised data catalogue tool, or it may be integrated into the data repository.

Permissions and may be handled by keeping data files in separate folders or by a specialised database.

Data storage systems should meet the following criteria.

Maturity Criteria

[Activity] across Maturity Levels

[Activity] across Maturity Factors

Useful Resources

Share

Maturity Levels | Data Management Activities | Maturity Factors
Capture | Ingest | Store | Share | Analyse

Definition

Make data available to people and systems.

Description

Data sharing refers to making data available to people within or outside the organisation managing the data. Data may be shared by human-readable interfaces, or shared directly to other computer systems.

Data sharing may be carried out using manual processes and ad hoc tools such as email. More sophisticated systems have human and machine readable interfaces for data-sharing integrated with the primary data repository and catalogue.

The aim of data sharing is to ensure that data is easily available to people who are authorised to access it.

Data sharing systems should meet the following criteria:

User Stories

Maturity Criteria

[Activity] across Maturity Levels

[Activity] across Maturity Factors

Useful Resources

Analyse

Maturity Levels | Data Management Activities | Maturity Factors
Capture | Ingest | Store | Share | Analyse

Definition

Combine, compare and summarise data.

Description

User Stories

Maturity Criteria

[Activity] across Maturity Levels

[Activity] across Maturity Factors

Maturity Factors

This Guide defines five factors that determine the maturity of biodata management:

The following sections provide a description of each factor, and maturity criteria for each factor for each data management activity, and each level of maturity.

Processes

Maturity Levels | Data Management Activities | Maturity Factors
Processes | Tools | Formats | Licensing | Reliability | Standards

Definition

The maintenance and adoption of standard procedures for managing data.

Description

Data quality can be maintained by consistently managing data using good practices. Good practices must be developed and documented. practices should ideally be under constant controlled review. The practices must be adopted and should include mechanisms to ensure that practices are followed consistently, and measure that they are.

User Stories

Maturity Criteria

[Factor] across Maturity Levels

[Factor] across data management Activities

Useful Resources

Tools

Maturity Levels | Data Management Activities | Maturity Factors
Processes | Tools | Formats | Licensing | Reliability | Standards

Definition

The technical tools that are used to manage data.

Description

Four main technical tools are used for biodata management:

  1. Field Data Capture Tools – Paper or digital tools to support field surveys, capture of the data collected and delivery of that data for ingestion.
  2. A Data Repository – One or more physical or digital containers where data resides permanently.
  3. A Metadata Repository – A catalogue of datasets.
  4. Analysis Tools – Tools for combining, comparing and summarising data.

The data repository, data catalogue and analysis tools may be separate or integrated. Both the data repository and data catalogue are used in the Ingest, Store and Share activities.

Mature data management technology satisfies the following criteria.

The following considerations are also important.

User Stories

Maturity Criteria

[Factor] across Maturity Levels

[Factor] across data management Activities

Useful Resources

Formats

Maturity Levels | Data Management Activities | Maturity Factors
Processes | Tools | Formats | Licensing | Reliability | Standards

Definition

The way that data is arranged for storage.

Description

The format in which data is stored determines the ease with which the data can be used and exchanged. Data in paper notebooks is not as easily used and as data on structured paper forms. Data in obsolete formats or formats that can only be accessed using proprietary tools is difficult to share and can easily be lost. Data that is published in PDF format is not as easily re-used as data published in a proprietary spreadsheet form like Microsoft Excel. Data published in an open format such as a CSV file is even more easily re-used.

User Stories

Maturity Criteria

[Factor] across Maturity Levels

[Factor] across data management Activities

Useful Resources

Licensing

Maturity Levels | Data Management Activities | Maturity Factors
Processes | Tools | Formats | Licensing | Reliability | Standards

Definition

Explicit statement of permitted uses of the data.

Description

Almost all biodata is of potential value in developing national and international understanding of biodiversity state and trends. Most biodata can be shared with the public without constraints.

In some cases, however it is necessary to restrict access to biodata. Data may be collected under an agreement with a land-owner that restricts sharing. Some data relates to the occurrence of species that could be the target of rare species trafficking or other types of over-exploitation. Some data may be sensitive to stakeholders in ways that are not anticipated.

The following permissions information should be associated with all datasets.

Where the canonical instance of a dataset is contributed to a third party for permanent curation, a data licence should cover uses that are permitted by the Copyright Owner.

Data Provision (or Sharing) Agreements should make the licensing explicit.

Types of licence include the following.

Where data itself can not be disclosed, the existence of the data should be discoverable.

User Stories

Maturity Criteria

[Factor] across Maturity Levels

[Factor] across data management Activities

Useful Resources

Reliability

Maturity Levels | Data Management Activities | Maturity Factors
Processes | Tools | Formats | Licensing | Reliability | Standards

Definition

Ensuring that the user knows how reliable the data is.

Description

Key is able todetermine the provenance

The reliability of data determines how useful it is. It depends on the method used to collect it, the rigour with which that method was followed. The reliability of data as an indicator of current state depends on those things and the age of the data. Data that is highly volatile (eg bird counts) loses reliability faster than data of low volatility (eg GPS references).

The system should allow data of varying reliability to be managed. The reliability of the data must be recorded when data is imported. The system should maintain data reliability indices as they change over time. It should make the reliability of the data easily discoverable. Only data of higher reliability should be shown to users who are not qualified to interpret data of low reliability. Where data of low reliability is not shown, the existence of the data should be discoverable.

The data depreciation model developed by Bay of Plenty Regional Council functions as follows.

Each dataset has a current reliability value denoting its value to non-specialists for determining current state. The system uses the current reliability value to determine the sort order of datasets or to filter out data of low value.

When each dataset is ingested, it is assigned an initial reliability value. A casual observation of a bird in a wetland has a lower reliability than a survey carried out by the same person using a rigorous methodology.

The dataset is also assigned a depreciation profile that determines the rate, and changes to the rate at which a dataset depreciates over time. Data does not depreciate in a straight line. It may depreciate slowly for the first ten years, and then more quickly for the ten years after that.

Data depreciation profiles are determined by technical experts for specific dataset types depending on the species group, purpose, methodology and conditions of the collection. Datasets must be matched to the correct depreciation profile when they are imported.

Some systems use crowd-sourced ways to improve reliability, for example where experts can verify citizen observations (NatureWatch), or where data users can report errors (ALA).

Current values for all datasets are recalculated nightly.

At the least:

Factors that determine initial reliability:

User Stories

Maturity Criteria

[Factor] across Maturity Levels

[Factor] across data management Activities

Useful Resources

Standards

Maturity Levels | Data Management Activities | Maturity Factors
Processes | Tools | Formats | Licensing | Reliability | Standards

Definition

The terms that are chosen to describe the data.

Description

In order to facilitate data-sharing with other NZ agencies and via international mechanisms, the system should expose all for sharing in a form that complies with applicable international data standards.

Field protocols and methodology, while important for gathering data of high quality, are excluded from the scope of this Guide.

The following sets of data standards are important for biodata management.

User Stories

Maturity Criteria

[Factor] across Maturity Levels

[Factor] across data management Activities

Useful Resources