Analyzing the licenses of all 11,000+ GBIF registered datasets

How much GBIF mediated data can be legally used easily? A collaborative analysis.

• Peter Desmet

Image by Peter Desmet

In my previous post, I highlighted the legal issues showing 13,297 American bullfrog records downloaded from GBIF on a map. 96% of those records had no or a non-standard data license, making data use legally cumbersome.

But how much of this applies to all 417+ million occurrence records in GBIF? How challenging is GBIF’s 2014 mission to provide a machine readable, standard license for all datasets? Fellow Datafable1 member Bart Aelterman and I tried to figure out.

Methodology

We used the GBIF registry API to obtain the metadata for all 11,000+ GBIF registered datasets and in particular the rights field, which is where data publishers can provide the license under which the dataset is published. We then created a unique list of all licenses used, which we annotated with parameters such as use allowed and attribution required. This information was joined back with the dataset information to get an idea of the distribution of certain types of licenses over all datasets and occurrence records. We also documented the guidelines we used for annotating these licenses.

In total we analyzed 11,974 datasets2, representing 415,927,654 occurrences. The first thing we noticed is that only 10% of those datasets (26% of the occurrences) have a license. This is problematic (see further), but it had the welcome side effect that we “only” had to annotate 432 different licenses.

All code and data3 for this project are available on GitHub. #openresearch #ftw

Results

Overview of the licenses used

License # of datasets # of records % of records GBIF practice? Open data?
CC0 105 2,155,108 0.5% yes yes
CC BY 8 2,240,674 0.5% yes yes
ODC-By 11 567,675 0.1% yes yes
CC BY-SA 16 450,421 0.1 no yes
ODbL & DbCL 3 864 0.0% no yes
CC BY-NC 10 4,308,627 1.0% expected by some no
CC BY-NC-SA 17 569,040 0.1% no no
CC BY-NC-ND 1 26,132 0.0% no no
Non-standard license 1,069 100,062,731 24.1% ? ?
No license 10,734 305,546,382 73.5% ? ?

Standard licenses

Ignoring for a moment that CC0 is the only sensible license for data, a standard license (Creative Commons or Open Data Commons) is at least standardized and easy to understand. Only 1.4% of all datasets however (2% of all occurrences) are published with a standard license.

Data dedicated to the public domain under CC0 represents an even smaller percentage: 0.9% of all datasets (0.5% of all occurrences). The silver lining is that most data publishers who choose a standard license, choose CC0 (105 datasets).

Interpreting the other licenses

All other data are provided with no or a non-standard license, with a percentage similar to the bullfrog sample (98% vs 96% of the occurrences). These data are in a legal gray zone: it’s a mixture of legalese, norms, restrictions, agreements, or in most cases no information at all. It is up to every data user to figure out the details.

We tried to lift some of that burden by interpreting all these licenses, extracting some characteristics, but it should be clear that this is an attempt4 that should only be used with caution. The results are presented in the charts below. You can click the legends to toggle parts of the chart.

Datasets

Occurrences

Conclusion

Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a bleak picture. Very few GBIF registered datasets can be easily and legally used, let alone without restrictions. This is mainly due to data being published with no or a non-standard license.

Fixing this is crucial, and GBIF’s 2014 mission to provide a machine readable, standard license to all datasets is a step in the good direction. We hope our analysis (which can be run again) and guidelines already help with:

The Secretariat would review existing metadata provisionally to assign5 each current data set to one of these categories and would then communicate with data publishers to confirm the assignment. [source]

More importantly, this mission should be used as an opportunity to make the rights field mandatory, require CC0, and shift the discussion about ethical data use (including attribution) to norms rather than ill-suited legal tools.

  1. To combine our skills and organize some of our extracurricular activities, we started a team of open data enthusiasts called Datafable. The results of our first project was published by GBIF last week. 

  2. These include checklist and occurrence datasets. Obviously, only occurrence datasets are represented in the results for occurrences. 

  3. Additional legal issue: what license applies to the metadata of GBIF registered datasets? Can we publish even part of it on a GitHub repository? Note that metadata does include creative content, and some of it is even published as data papers. 

  4. We considered an alternative interpretation, taking into account the GBIF use agreement (DUA). Jonathan A. Rees pointed out however that a DUA can only add restrictions or conditions, but never grant permissions (only copyright holders have the legal standing to do so). In other words, the GBIF DUA does not solve the situation of having no license: users still have to figure out the legal implications. See this issue for the whole discussion. 

  5. The characteristics we assigned to the licenses (commercial use allowed, notification required, etc.) could even be provided as machine tags on the GBIF portal, allowing users to already get some indication of what is allowed/required.