Analyzing the licenses of all 11,000+ GBIF registered datasets

How much GBIF mediated data can be legally used easily? A collaborative analysis.

November 22, 2013 • Peter Desmet

Image by Peter Desmet

Methodology

We used the GBIF registry API to obtain the metadata for all 11,000+ GBIF registered datasets and in particular the rights field, which is where data publishers can provide the license under which the dataset is published. We then created a unique list of all licenses used, which we annotated with parameters such as use allowed and attribution required. This information was joined back with the dataset information to get an idea of the distribution of certain types of licenses over all datasets and occurrence records. We also documented the guidelines we used for annotating these licenses.

In total we analyzed 11,974 datasets², representing 415,927,654 occurrences. The first thing we noticed is that only 10% of those datasets (26% of the occurrences) have a license. This is problematic (see further), but it had the welcome side effect that we “only” had to annotate 432 different licenses.

All code and data³ for this project are available on GitHub. #openresearch #ftw

Results

Overview of the licenses used

License	# of datasets	# of records	% of records	GBIF practice?	Open data?
CC0	105	2,155,108	0.5%	yes	yes
CC BY	8	2,240,674	0.5%	yes	yes
ODC-By	11	567,675	0.1%	yes	yes
CC BY-SA	16	450,421	0.1	no	yes
ODbL & DbCL	3	864	0.0%	no	yes
CC BY-NC	10	4,308,627	1.0%	expected by some	no
CC BY-NC-SA	17	569,040	0.1%	no	no
CC BY-NC-ND	1	26,132	0.0%	no	no
Non-standard license	1,069	100,062,731	24.1%	?	?
No license	10,734	305,546,382	73.5%	?	?

Standard licenses

Ignoring for a moment that CC0 is the only sensible license for data, a standard license (Creative Commons or Open Data Commons) is at least standardized and easy to understand. Only 1.4% of all datasets however (2% of all occurrences) are published with a standard license.

Data dedicated to the public domain under CC0 represents an even smaller percentage: 0.9% of all datasets (0.5% of all occurrences). The silver lining is that most data publishers who choose a standard license, choose CC0 (105 datasets).

Interpreting the other licenses

All other data are provided with no or a non-standard license, with a percentage similar to the bullfrog sample (98% vs 96% of the occurrences). These data are in a legal gray zone: it’s a mixture of legalese, norms, restrictions, agreements, or in most cases no information at all. It is up to every data user to figure out the details.

We tried to lift some of that burden by interpreting all these licenses, extracting some characteristics, but it should be clear that this is an attempt⁴ that should only be used with caution. The results are presented in the charts below. You can click the legends to toggle parts of the chart.

Datasets

Occurrences

Conclusion

Our analysis of the licenses of all 11.000+ GBIF registered datasets shows a bleak picture. Very few GBIF registered datasets can be easily and legally used, let alone without restrictions. This is mainly due to data being published with no or a non-standard license.

Fixing this is crucial, and GBIF’s 2014 mission to provide a machine readable, standard license to all datasets is a step in the good direction. We hope our analysis (which can be run again) and guidelines already help with:

The Secretariat would review existing metadata provisionally to assign⁵ each current data set to one of these categories and would then communicate with data publishers to confirm the assignment. [source]

More importantly, this mission should be used as an opportunity to make the rights field mandatory, require CC0, and shift the discussion about ethical data use (including attribution) to norms rather than ill-suited legal tools.

To combine our skills and organize some of our extracurricular activities, we started a team of open data enthusiasts called Datafable. The results of our first project was published by GBIF last week. ↩
These include checklist and occurrence datasets. Obviously, only occurrence datasets are represented in the results for occurrences. ↩
Additional legal issue: what license applies to the metadata of GBIF registered datasets? Can we publish even part of it on a GitHub repository? Note that metadata does include creative content, and some of it is even published as data papers. ↩
We considered an alternative interpretation, taking into account the GBIF use agreement (DUA). Jonathan A. Rees pointed out however that a DUA can only add restrictions or conditions, but never grant permissions (only copyright holders have the legal standing to do so). In other words, the GBIF DUA does not solve the situation of having no license: users still have to figure out the legal implications. See this issue for the whole discussion. ↩
The characteristics we assigned to the licenses (commercial use allowed, notification required, etc.) could even be provided as machine tags on the GBIF portal, allowing users to already get some indication of what is allowed/required. ↩

On this page