During the last decade, we have seen an explosive increase in interest on openly available datasets of species observations. We even coined the term “Biodiversity Informatics” for the discipline that helps to collect, mobilize, digitize, curate, store, visualize and analyse this immense amount of data. We have previously discussed how the different stakeholders had set the Global Biodiversity Information Outlook (GBIO), paving the road for the development of the discipline. The GBIO has set up a framework describing the many problems and potentials of biodiversity data including the culture where the data is embedded, the data itself, the evidence required to refine, structure and evaluate it, and finally, the understanding acquired after analysing it. Here, I want to open up the discussion about the data itself.

Primary Biodiversity Data (PBD) is the confirmed occurrence of a species; that is a data-point with spatial coordinates, a date, and some taxonomic identification. Error and varying degrees of accuracy could be found in each of these dimensions. The word primary refers to the fact that it has not been elaborated in any way. It is the simplest raw data point that can be collected to sample biodiversity. Simple, but not quite. There is a lot more that can be said about this data point (metadata). Information about the way it was observed/collected, who did it, how many specimens where there, are DNA samples available, and many more variables can follow each observation. All this is regulated by standards defining the fields to fill in, and the possible values of each field can have. For PBD it is DarwinCore (http://rs.tdwg.org/dwc/) what counts.

Fig. 1: Primary Biodiversity Data (PBD) is the currency of Biodiversity informatics. It is basically a date, a spatial coordinate, and a species name… but much more can be said about each data point (i.e. metadata). There are many different sources and types of data with very different qualities. Therefore, many terms have been used referring to different aspects of it. Many overlapped. With time, we got to realize that we have been talking about the same data, but not quite.

However, in the big data era, many new sources of PBD have emerged. The new sources are mainly non-systematically collected a.k.a. opportunistic data. Many new terms have also emerged to refer to these sources of information or types of data (Fig. 1), but their definition is not always clear because they overlap. But neither term is fully contained by another term. In the figure 1 we can see that three major terms refer to the storage of the data. Is it centralized in a database or do you need to systematically search for each observation on e.g. Twitter? A clear overlap between Biodiversity Information Systems and crowdsourcing is e.g. Artportalen.se or eBird.org, where the effort of lots of individual observers gets stored and curated in an open and standardized database. Is the data lying in an explorer’s notebook? It can happen that e.g. data is still in a non-standardized Excel file in a single computer and out of reach of the general public. The data is digitized, but in practise, still inexistent.

The next big category is opportunistic data, a term that is many times used as a synonym of citizen-science data. Opportunistic simply refers to the lack of an un-biased sampling design (non-systematic), which is what often happens when lay people report their observations to biodiversity information systems. But this is also the case of most data stored in museum collected by explorers that have collected observations as they went along new paths. On the contrary, many citizen-science programs do count with a carefully designed sampling schemes and protocols. It is interesting to see how the level of freedom volunteers are given positively affects their permanence in the program; but that is for another chapter. One of the major challenges nowadays is that opportunistic data shares the floor with standardized and (relatively) bias free data, which until a decade ago was widely considered as the only way to do science. The point is that there is potential in both approaches, but the analyst should clearly know how the data was collected and if she/he is handling mixed sources of data.

Fig. 2: The vast amounts of species observations accumulated offer us new opportunities to understand nature. However, malt is not whisky! We need to know how to extract the essence out of it. The modern data distillery allows concentrating the information present in Primary Biodiversity Data that often comes from different sources with very different quality; it is just like “Moonshine” for biodiversity conservation. The model, analogous to the alembic, is the container we give to the data allowing only the desired essence to escape. The flame, in our case is the Bayesian fire, a statistical approach that makes the data turn and boil until the signal is identified and isolated. Other statistical procedures exist, but Bayesian statistics is the state-of-the-art, and we prefer it as it allows accounting for the many sources of uncertainties. (Image modified from generic alembic drawing)

For many years the analytic approach used to handle with bias data was to filter “good” from “bad” data. However, the biodiversity informatics community quickly realised that all data has valuable information, it is just a matter to learn how to deal with its error. Today, there is a great effort trying to put all PBD to the service of nature conservation and environmental education (“Connecting Earth observation to high-throughput biodiversity data”) accounting for the diversity of biases inherent to it. We can see it as a modern data distillery (Fig. 2). PBD is the impure raw element, where impurities are errors and sampling biases. An occupancy-detection model is a way of extracting the relevant information from many observations of different quality and concentrate it into a refined product that can be used in several different applications with confidence. We have captured the “information essence” of all observations. For example: let’s assume we have many presence-only observations of birds on a site, some of those are incidental and without much effort, some others very detailed and reporting everything they see. Now let’s consider two species: one, a passerine that is only observed during the early morning walks of observer 1, that only reports those birds, because she is fascinated with them, the other a very common duck that is only reported by those that systematically report everything they see. Under the “filtering” approach many observations would have been simply ignored, including those unique observations of the early morning bird. Alternatively, in the occupancy-detection model, all observations are weighted based on their quality (there are many ways to describe quality) and the likelihood of a species being missed is calculated (per observation). Hence, we can obtain with different certainty, the probability of species being truly absent from a site, a very valuable information piece when handling opportunistic data. All this could be done before further analyses (obtaining the refined product) or simultaneously with the prediction of species presence at unvisited sites based on complementary information about the environment.

At Greensway, we are doing our own bit to improve this data distillery. Together with Tomas Pärt and his research group at SLU we have developed a model that exploits the potentials of high densities of opportunistic data. In this way, we are now able to reduce the time resolution of our models and predictions from years to days, opening the doors to many new research questions. For example, we can now re-think the definition of species presenceand accurately filter local species list accounting only for species that fulfil certain phenological criteria. In this way, for a species to be included in a local species list, we can now check how many days it has been present at a site, and disregard those that have been only anecdotally there. Then, the variability among sites diversity is amplified allowing us to better understand what drives the assembly of communities.

How exciting times we live in! Let’s make this count for nature conservation.