Just came across this paper on bioRXiv.

A Measure of Open Data: A Metric and Analysis of Reusable Data Practices in Biomedical Data Resources

by Seth Carbon, Robin Champieux, Julie McMurry, Lilly WinfreeLetisha R Wyatt, and Melissa Haendel


Data is the foundation of science, and there is an increasing focus on how data can be reused and enhanced to drive scientific discoveries. However, most seemingly open data do not provide legal permissions for reuse and redistribution. Not being able to integrate and redistribute our collective data resources blocks innovation, and stymies the creation of life-improving diagnostic and drug selection tools. To help the biomedical research and research support communities (e.g. libraries, funders, repositories, etc.) understand and navigate the data licensing landscape, the (Re)usable Data Project (RDP, http://reusabledata.org) assesses the licensing characteristics of data resources and how licensing behaviors impact reuse. We have created a ruleset to determine the reusability of data resources and have applied it to 56 scientific data resources (i.e. databases) to date. The results show significant reuse and interoperability barriers. Inspired by game-changing projects like Creative Commons, the Wikipedia Foundation, and the Free Software movement, we hope to engage the scientific community in the discussion regarding the legal use and reuse of scientific data, including the balance of openness and how to create sustainable data resources in an increasingly competitive environment.

They outline five criteria they propose be used for measuring data reusability

  • Criteria A: Is the license or terms of use in an easy-to-find location? Is there one, unambiguous license, as opposed to multiple, conflicting versions? Is the license standard?
  • Criteria B: Does the license clearly define the terms of continuing reuse without need for negotiation with the data creators or resource curators? Does the license have a complete scope that covers all of the data and not just a portion?
  • Criteria C: Does the resource provide its data in a reasonable good-faith location, and is there a reasonable and transparent method of accessing that data in bulk?
  • Criteria D: Are all types of reuse (copying, editing, building upon, remixing, distributing) allowable, with or without attribution?
  • Criteria E: Can any type of user group reuse the data?

And then test it out on 56 data sources.

I looked into the paper to see what those data sources were and found a bit of a bug in the paper. Where it reads

Complete evaluations can be viewed on the RDP website and the RDP Github repository.

There were supposed to be weblinks but they were not there.

I then went to their website and found what appear to be the Data Resources they tested out.  I am a bit surprised that this information is not in the paper itself, and furthermore that it is not shared in a repository of some kind (just in case their website goes away or breaks).  It seems a bit strange that their data about analyzing data reusability is not particularly easily findable / reusable itself (e.g., all I could find was a web page summarizing what they found.  There does not appear to be a place to download the data behind their findings). Maybe the data is available in the Github repository in some way.

Anyway – seems like a useful project and worth checking out.




