Introduction

This document is intended to act as a primer for the use of the new Neotoma R package, neotoma2. The neotoma2 package is available from GitHub and can be installed in R using the devtools package by using:

devtools::install_github('NeotomaDB/neotoma2')
library(neotoma2)

In this tutorial you will learn how to:

  • Search for sites using site names and geographic parameters
  • Filter results using temporal and spatial parameters
  • Obtain sample information for the selected datasets
  • Perform basic analysis including the use of climate data from rasters

Accessing and Manipulating Data with neotoma2

For this workbook we use several packages, including leaflet, sf and others. We load the packages using the pacman package, which will automatically install the packages if they do not currently exist in your set of packages.

options(warn = -1)
pacman::p_load(neotoma2, dplyr, ggplot2, sf, geojsonsf, leaflet, raster, DT)

Note that R is sensitive to the order in which packages are loaded. Using neotoma2:: tells R explicitly that you want to use the neotoma2 package to run a particular function. So, for a function like filter(), which exists in other packages such as dplyr, you may see an error that looks like:

Error in UseMethod("filter") : 
  no applicable method for 'filter' applied to an object of class "sites"

In that case it’s likely that the wrong package is trying to run filter(), and so explicitly adding dplyr:: or neotoma2:: in front of the function name (i.e., neotoma2::filter())is good practice.

Getting Help with Neotoma

If you’re planning on working with Neotoma, please join us on Slack where we manage a channel specifically for questions about the R package. You may also wish to join our Google Groups mailing list, please contact us to be added.

Site Searches

get_sites()

There are several ways to find sites in neotoma2, but we think of sites as being spatial objects primarily. They have names, locations, and are found within the context of geopolitical units, but within the API and the package, the site itself does not have associated information about taxa, dataset types or ages. It is simply the container into which we add that information. So, when we search for sites we can search by:

  • siteid
  • sitename
  • location
  • altitude (maximum and minimum)
  • geopolitical unit

Site names: sitename="%Lait%"

We may know exactly what site we’re looking for (“Lac Mouton”), or have an approximate guess for the site name (for example, we know it’s something like “Lait Lake”, or “Lac du Lait”, but we’re not sure how it was entered specifically).

We use the general format: get_sites(sitename="XXXXX") for searching by name.

PostgreSQL (and the API) uses the percent sign as a wildcard. So "%Lait%" would pick up “Lac du Lait” for us (and would pick up “Lake Lait” and “This Old Laity Hei-dee-ho Bog” if they existed). Note that the search query is also case insensitive, so you could simply write "%lait%".

Code
spo_sites <- neotoma2::get_sites(sitename = "%Lait%")
plotLeaflet(spo_sites)
Result

Location: loc=c()

The neotoma package used a bounding box for locations, structured as a vector of latitude and longitude values: c(xmin, ymin, xmax, ymax). The neotoma2 R package supports both this simple bounding box, but also more complex spatial objects, using the sf package. Using the sf package allows us to more easily work with raster and polygon data in R, and to select sites from more complex spatial objects. The loc parameter works with the simple vector, WKT, geoJSON objects and native sf objects in R. Note however that the neotoma2 package is a wrapper for a simple API call using a URL (api.neotomadb.org), and URL strings can only be 1028 characters long, so the API cannot accept very long/complex spatial objects.

Looking for sites using a location. We’re putting three representations of the Czech Republic here as part of a list with three elements, a geoJSON, WKT and bounding box representation. We’ve also transformed the cz$geoJSON element to an object for the sf package. Any of these four spatial representations work with the neotoma2 package.

cz <- list(geoJSON = '{"type": "Polygon",
        "coordinates": [[
            [12.40, 50.14],
            [14.10, 48.64],
            [16.95, 48.66],
            [18.91, 49.61],
            [15.24, 50.99],
            [12.40, 50.14]]]}',
        WKT = 'POLYGON ((12.4 50.14, 
                         14.1 48.64, 
                         16.95 48.66, 
                         18.91 49.61,
                         15.24 50.99,
                         12.4 50.14))',
        bbox = c(12.4, 48.64, 18.91, 50.99))

cz$sf <- geojsonsf::geojson_sf(cz$geoJSON)[[1]]

cz_sites <- neotoma2::get_sites(loc = cz$geoJSON, all_data = TRUE)
## Your search returned 89 objects.

You can always simply plot() the sites objects, but you will lose some of the geographic context. The plotLeaflet() function returns a leaflet() map, and allows you to further customize it, or add additional spatial data (like our original bounding polygon, cz$sf, which works directly with the R leaflet package):

Code
neotoma2::plotLeaflet(cz_sites) %>% 
  leaflet::addPolygons(map = ., 
                       data = cz$sf, 
                       color = "green")
Result

Site Helpers

Neotoma R Package UML diagram.

If we look at the UML diagram for the objects in the neotoma2 R package we can see that there are a set of functions that can operate on sites. As we add to sites objects, using get_datasets() or get_downloads(), we are able to use more of these helper functions. As it is, we can take advantage of sunctions like summary() to get a more complete sense of the types of data we have as part of this set of sites. The following code gives the summary table. We do some R magic here to change the way the data is displayed (turning it into a datatable() object), but the main piece is the summary() call.

Code
neotoma2::summary(cz_sites)
Result

We can see that there are no chronologies associated with the site objects. This is because, at present, we have not pulled in the dataset information we need. All we know from get_sites() are the kinds of datasets we have.

Searching for datasets:

We know that collection units and datasets are contained within sites. Similarly, a sites object contains collectionunits which contain datasets. From the table above we can see that some of the sites we’ve looked at contain pollen records. That said, we only have the sites, it’s just that (for convenience) the sites API returns some information about datasets so to make it easier to navigate the records.

With a sites object we can directly call get_datasets(), to pull in more metadata about the datasets. At any time we can use datasets() to get more information about any datasets that a sites object may contain. Compare the output of datasets(cz_sites) to the output of a similar call using the following:

Code

cz_datasets <- neotoma2::get_datasets(cz_sites, all_data = TRUE)

datasets(cz_datasets)

Result

Filter Records

If we choose to pull in information about only a single dataset type, or if there is additional filtering we want to do before we download the data, we can use the filter() function. For example, if we only want pollen records, and want records with known chronologies, we can filter:

Code

cz_pollen <- cz_datasets %>% 
  neotoma2::filter(datasettype == "pollen" & !is.na(age_range_young))

neotoma2::summary(cz_pollen)

Result

We can see now that the data table looks different, and there are fewer total sites.

Pulling in sample() data.

Because sample data adds a lot of overhead (for the Czech pollen data, the object that includes the dataset with samples is 20 times larger than the dataset alone), we try to call get_downloads() after we’ve done our preliminary filtering. After get_datasets() you have enough information to filter based on location, time bounds and dataset type. When we move to get_download() we can do more fine-tuned filtering at the analysis unit or taxon level.

The following call can take some time, but we’ve frozen the object as an RDS data file. You can run this command on your own, and let it run for a bit, or you can just load the object in.

## This line is commented out because we've already run it for you.
## cz_dl <- cz_pollen %>% get_downloads(all_data = TRUE)
cz_dl <- readRDS('data/czDownload.RDS')

Once we’ve downloaded, we now have information for each site about all the associated collection units, the datasets, and, for each dataset, all the samples associated with the datasets. To extract all the samples we can call:

allSamp <- samples(cz_dl)

When we’ve done this, we get a data.frame that is 130889 rows long and 37 columns wide. The reason the table is so wide is that we are returning data in a long format. Each row contains all the information you should need to properly interpret it:

##  [1] "age"             "agetype"         "ageolder"        "ageyounger"     
##  [5] "chronologyid"    "chronologyname"  "units"           "value"          
##  [9] "context"         "element"         "taxonid"         "symmetry"       
## [13] "taxongroup"      "elementtype"     "variablename"    "ecologicalgroup"
## [17] "analysisunitid"  "sampleanalyst"   "sampleid"        "depth"          
## [21] "thickness"       "samplename"      "datasetid"       "siteid"         
## [25] "sitename"        "lat"             "long"            "area"           
## [29] "sitenotes"       "description"     "elev"            "collunitid"     
## [33] "database"        "datasettype"     "age_range_old"   "age_range_young"
## [37] "datasetnotes"

For some dataset types, or analyses some of these columns may not be needed, however, for other dataset types they may be critically important. To allow the neotoma2 package to be as useful as possible for the community we’ve included as many as we can.

Extracting Taxa

If you want to know what taxa we have in the record you can use the helper function taxa() on the sites object. The taxa() function gives us, not only the unique taxa, but two additional columns, sites and samples that tell us how many sites the taxa appear in, and how many samples the taxa appear in, to help us better understand how common individual taxa are.

Code
neotomatx <- neotoma2::taxa(cz_dl)
Results

The taxonid values can be linked to the taxonid column in the samples(). This allows us to build taxon harmonization tables if we choose to. You may also note that the taxonname is in the field variablename. Individual sample counts are reported in Neotoma as variables. A “variable” may be either a species, something like laboratory measurements, or a non-organic proxy, like charcoal or XRF measurements, and includes the units of measurement and the value.

Simple Harmonization

Lets say we want all samples from which Plantago taxa have been reported to be grouped together into one pseudo-taxon called Plantago. There are several ways of doing this, either directly by exporting the file and editing each individual cell, or by creating an external “harmonization” table (which we did in the prior neotoma package).

Programmatically, we can harmonize taxon by taxon using matching and transformation. We’re using dplyr type coding here to mutate() the column variablename so that any time we detect (str_detect()) a variablename that starts with Plantago (the .* represents a wildcard for any character [.], zero or more times [*]) we replace() it with the character string "Plantago". Note that this changes Plantago in the allSamp object, but if we were to call samples() again, the taxonomy would return to its original form.

We’re going to filter the ecological groups to include only UPHE (upleand/heath) and TRSH (trees and shrubs). More information about ecological groups is available from the Neotoma Online Manual.

allSamp <- allSamp %>% 
  dplyr::filter(ecologicalgroup %in% c("UPHE", "TRSH")) %>%
  mutate(variablename = replace(variablename, 
                                stringr::str_detect(variablename, "Plantago.*"), 
                                "Plantago"))

There were originally 15 different taxa identified as being within the genus Plantago (including Plantago, Plantago major, and Plantago alpina-type). The above code reduces them all to a single taxonomic group Plantago.

If we want to have an artifact of our choices, we can use an external table. For example, a table of pairs (what we want changed, and the name we want it replaced with) can be generated, and it can include regular expressions (if we choose):

original replacement
Abies.* Abies
Vaccinium.* Ericaceae
Typha.* Aquatic
Nymphaea Aquatic

We can get the list of original names directly from the taxa() call, applied to a sites object, and then export it using write.csv().

Code
taxaplots <- taxa(cz_dl)
# Save the taxon list to file so we can edit it subsequently.
readr::write_csv(taxaplots, "data/mytaxontable.csv")
Result

Figure. A plot of the number of sites a taxon appears in, against the number of samples a taxon appears in.

The plot is mostly for illustration, but we can see, as a sanity check, that the relationship is as we’d expect.

You can then export either one of these tables and add a column with the counts, you could also add extra contextual information, such as the ecologicalgroup or taxongroup to help you out. Once you’ve cleaned up the translation table you can load it in, and then apply the transformation:

translation <- readr::read_csv("data/taxontable.csv")

You can see we’ve changed some of the taxon names in the taxon table (don’t look too far, I just did this as an example). To replace the names in the samples() output, we’ll join the two tables using an inner_join() (meaning the variablename must appear in both tables for the result to be included), and then we’re going to select only those elements of the sample tables that are relevant to our later analysis:

allSamp <- samples(cz_dl)

allSamp <- allSamp %>%
  inner_join(translation, by = c("variablename" = "variablename")) %>% 
  dplyr::select(!c("variablename", "sites", "samples")) %>% 
  group_by(siteid, sitename, replacement,
           sampleid, units, age,
           agetype, depth, datasetid,
           long, lat) %>%
  summarise(value = sum(value), .groups='keep')

Simple Analytics

Stratigraphic Plotting

We can use packages like rioja to do stratigraphic plotting for a single record, but first we need to do some different data management. Although we could do harmonization again we’re going to simply take the top ten most common taxa at a single site and plot them in a stratigraphic diagram.

We’re using the arrange() call to sort by the number of times that the taxon appears within the core. This way we can take out samples and select the taxa that appear in the first ten rows of the plottingTaxa data.frame.

# Get a particular site, select only taxa identified from pollen (and only trees/shrubs)
plottingSite <- cz_dl[[1]]

plottingTaxa <- taxa(plottingSite) %>%
  filter(ecologicalgroup %in% c("TRSH")) %>%
  filter(elementtype == "pollen") %>%
  arrange(desc(samples)) %>% 
  head(n = 10)

# Clean up. Select only pollen measured using NISP.
# We repeat the filters for pollen & ecological group on the samples
shortSamples <- samples(plottingSite) %>% 
  filter(variablename %in% plottingTaxa$variablename) %>% 
  filter(ecologicalgroup %in% c("TRSH")) %>%
  filter(elementtype == "pollen") %>%
  filter(units == "NISP")

# Transform to proportion values.
onesite <- shortSamples %>%
  group_by(age) %>%
  mutate(pollencount = sum(value, na.rm = TRUE)) %>%
  group_by(variablename) %>% 
  mutate(prop = value / pollencount) %>% 
  arrange(desc(age))

# Spread the data to a "wide" table, with taxa as column headings.
widetable <- onesite %>%
  dplyr::select(age, variablename, prop) %>% 
  mutate(prop = as.numeric(prop))

counts <- tidyr::pivot_wider(widetable,
                             id_cols = age,
                             names_from = variablename,
                             values_from = prop,
                             values_fill = 0)

This appears to be a fairly long set of commands, but the code is pretty straightforward, and it provides you with significant control over the taxa, units and other elements of your data before you get them into the wide matrix (depth by taxon) that most statistical tools such as the vegan package or rioja use.

To plot we can use rioja’s strat.plot(), sorting the taxa using weighted averaging scores (wa.order). I’ve also added a CONISS plot to the edge of the the plot, to show how the new wide data frame works with distance metric funcitons.

clust <- rioja::chclust(dist(sqrt(counts)),
                        method = "coniss")

plot <- rioja::strat.plot(counts[,-1] * 100, yvar = counts$age,
                  title = cz_dl[[1]]$sitename,
                  ylabel = "Calibrated Years BP",
                  xlabel = "Pollen (%)",
                  y.rev = TRUE,
                  clust = clust,
                  wa.order = "topleft", scale.percent = TRUE)

rioja::addClustZone(plot, clust, 4, col = "red")

Change in Time Across Sites

We now have site information across the Czech Republic, with samples, and with taxon names. I’m interested in looking at the distributions of taxa across time, their presence/absence. I’m going to pick the top 20 taxa (based on the number of times they appear in the records) and look at their distributions in time:

plottingTaxa <- taxa(plottingSite) %>%
  filter(ecologicalgroup %in% c("TRSH")) %>%
  filter(elementtype == "pollen") %>%
  arrange(desc(sites)) %>% 
  head(n = 20)

taxabyage <- samples(cz_dl) %>% 
  filter(variablename %in% plottingTaxa$variablename) %>% 
  group_by(variablename, "age" = round(age * 2, -3) / 2) %>% 
  summarise(n = length(unique(siteid)), .groups = 'keep')

samplesbyage <- samples(cz_dl) %>% 
  filter(variablename %in% plottingTaxa$variablename) %>% 
  group_by("age" = round(age * 2, -3) / 2) %>% 
  summarise(samples = length(unique(siteid)), .groups = 'keep')

groupbyage <- taxabyage %>%
  inner_join(samplesbyage, by = "age") %>% 
  mutate(proportion = n / samples)

ggplot(groupbyage, aes(x = age, y = proportion)) +
  geom_point() +
  geom_smooth(method = 'gam', 
              method.args = list(family = 'binomial')) +
  facet_wrap(~variablename) +
  coord_cartesian(xlim = c(20000, 0), ylim = c(0, 1)) +
  scale_x_reverse(breaks = c(10000, 20000)) +
  xlab("Proportion of Sites with Taxon") +
  theme_bw()

We can see clear patterns of change, and the smooths are modeled using Generalized Additive Models (GAMs) in R, so we can have more or less control over the actual modeling using the gam or mgcv packages. Depending on how we divide the data we can also look at shifts in altitude, latitude or longitude to better understand how species distributions and abundances changed over time in this region.

Distributions in Climate (July max temperature) from Rasters

We are often interested in the interaction between taxa and climate, assuming that time is a proxy for changing environments. The development of large-scale global datasets for climate has made it relatively straightforward to access data from the cloud in raster format. R provides a number of tools (in the sf and raster packages) for managing spatial data, and providing support for spatial analysis of data.

The first step is taking our sample data and turning it into a spatial object using the sf package in R:

modern <- samples(cz_dl) %>% 
  filter(age < 50) %>% 
  filter(ecologicalgroup == "TRSH" & elementtype == "pollen" & units == "NISP")

spatial <- sf::st_as_sf(modern, 
                        coords = c("long", "lat"),
                        crs = "+proj=longlat +datum=WGS84")

The data is effectively the same, sf makes an object called spatial that is a data.frame with all the information from samples(), and a column (geometry) that contains the spatial data.

We can use the getData() function in the raster package to get climate data from WorldClim. The operations that follow here can be applied to any sort of raster data, provided it is loaded into R as a raster object.

Here we pull in the raster data, at a 10 minute resolution for the \(T_{max}\) variable, maximum monthly temperature. The raster itself has 12 layers, one for each month. With the extract() function we just get information for the seventh month, July.

worldTmax <- raster::getData('worldclim', var = 'tmax', res = 10)
spatial$tmax7 <- raster::extract(worldTmax, spatial)[,7]

This adds a column to the data.frame spatial, that contains the maximum July temperature for each taxon at each site (all taxa at a site will share the same value). We’ve already filtered to all UPHE taxa, but that still leaves us with 1 distinct names for the taxa. We’re going to use dplyr’s mutate() function to extract just the genus:

spatial <- spatial %>%
  mutate(variablename = stringr::str_replace(variablename, "[[:punct:]]", " ")) %>% 
  mutate(variablename = stringr::word(variablename, 1)) %>% 
  group_by(variablename, siteid) %>% 
  summarise(tmax7 = max(tmax7), .groups = "keep") %>%
  group_by(variablename) %>% 
  filter(n() > 3)

Setting the Background

We want to get the background distribution of July temperatures in the Czech Republic, to plot our taxon distributions against by taking the maximum value of the temperature, however, since all values at the site are the same (because we used a spatial overlay) the maximum is the same as the actual July temperature at the site.

maxsamp <- spatial %>% 
  dplyr::group_by(siteid) %>% 
  dplyr::summarise(tmax7 = max(tmax7), .groups = 'keep')

Now we’re going to plot it out, using facet_wrap() to plot each taxon in its own panel:

ggplot() +
  geom_density(data = spatial,
               aes(x = round(tmax7 / 10, 0)), col = 2) +
  facet_wrap(~variablename) +
  geom_density(data = maxsamp, aes(x = tmax7 / 10)) +
  xlab("Maximum July Temperature") +
  ylab("Kernel Density")

Conclusion

So, we’ve done a lot in this example. We’ve (1) searched for sites using site names and geographic parameters, (2) filtered results using temporal and spatial parameters, (3) obtained sample information for the selected datasets and (4) performed basic analysis including the use of climate data from rasters. Hopefully you can use these examples as templates for your own future work, or as a building block for something new and cool!