This document is intended to act as a primer for the use of the new
Neotoma R package, neotoma2
. The neotoma2
package is available from GitHub and can be installed in R using the
devtools
package by using:
devtools::install_github('NeotomaDB/neotoma2')
library(neotoma2)
In this tutorial you will learn how to:
neotoma2
For this workbook we use several packages, including
leaflet
, sf
and others. We load the packages
using the pacman
package, which will automatically install
the packages if they do not currently exist in your set of packages.
options(warn = -1)
pacman::p_load(neotoma2, dplyr, ggplot2, sf, geojsonsf, leaflet, raster, DT)
Note that R is sensitive to the order in which packages are loaded.
Using neotoma2::
tells R explicitly that you want to use
the neotoma2
package to run a particular function. So, for
a function like filter()
, which exists in other packages
such as dplyr
, you may see an error that looks like:
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "sites"
In that case it’s likely that the wrong package is trying to run
filter()
, and so explicitly adding dplyr::
or
neotoma2::
in front of the function name (i.e.,
neotoma2::filter()
)is good practice.
If you’re planning on working with Neotoma, please join us on Slack where we manage a channel specifically for questions about the R package. You may also wish to join our Google Groups mailing list, please contact us to be added.
get_sites()
There are several ways to find sites in neotoma2
, but we
think of sites
as being spatial objects primarily. They
have names, locations, and are found within the context of geopolitical
units, but within the API and the package, the site itself does not have
associated information about taxa, dataset types or ages. It is simply
the container into which we add that information. So, when we search for
sites we can search by:
sitename="%Lait%"
We may know exactly what site we’re looking for (“Lac Mouton”), or have an approximate guess for the site name (for example, we know it’s something like “Lait Lake”, or “Lac du Lait”, but we’re not sure how it was entered specifically).
We use the general format: get_sites(sitename="XXXXX")
for searching by name.
PostgreSQL (and the API) uses the percent sign as a wildcard. So
"%Lait%"
would pick up “Lac du Lait” for us (and
would pick up “Lake Lait” and “This Old Laity
Hei-dee-ho Bog” if they existed). Note that the search query is also
case insensitive, so you could simply write "%lait%"
.
spo_sites <- neotoma2::get_sites(sitename = "%Lait%")
plotLeaflet(spo_sites)
loc=c()
The neotoma
package used a bounding box for locations,
structured as a vector of latitude and longitude values:
c(xmin, ymin, xmax, ymax)
. The neotoma2
R
package supports both this simple bounding box, but also more complex
spatial objects, using the sf
package.
Using the sf
package allows us to more easily work with
raster and polygon data in R, and to select sites from more complex
spatial objects. The loc
parameter works with the simple
vector, WKT, geoJSON objects and native
sf
objects in R. Note however that the
neotoma2
package is a wrapper for a simple API call using a
URL (api.neotomadb.org), and URL
strings can only be 1028 characters long, so the API cannot accept very
long/complex spatial objects.
Looking for sites using a location. We’re putting three
representations of the Czech Republic here as part of a list with three
elements, a geoJSON, WKT and bounding box representation. We’ve also
transformed the cz$geoJSON
element to an object for the
sf
package. Any of these four spatial representations work
with the neotoma2
package.
cz <- list(geoJSON = '{"type": "Polygon",
"coordinates": [[
[12.40, 50.14],
[14.10, 48.64],
[16.95, 48.66],
[18.91, 49.61],
[15.24, 50.99],
[12.40, 50.14]]]}',
WKT = 'POLYGON ((12.4 50.14,
14.1 48.64,
16.95 48.66,
18.91 49.61,
15.24 50.99,
12.4 50.14))',
bbox = c(12.4, 48.64, 18.91, 50.99))
cz$sf <- geojsonsf::geojson_sf(cz$geoJSON)[[1]]
cz_sites <- neotoma2::get_sites(loc = cz$geoJSON, all_data = TRUE)
## Your search returned 89 objects.
You can always simply plot()
the sites
objects, but you will lose some of the geographic context. The
plotLeaflet()
function returns a leaflet()
map, and allows you to further customize it, or add additional spatial
data (like our original bounding polygon, cz$sf
, which
works directly with the R leaflet
package):
neotoma2::plotLeaflet(cz_sites) %>%
leaflet::addPolygons(map = .,
data = cz$sf,
color = "green")
If we look at the UML
diagram for the objects in the neotoma2
R package we
can see that there are a set of functions that can operate on
sites
. As we add to sites
objects, using
get_datasets()
or get_downloads()
, we are able
to use more of these helper functions. As it is, we can take advantage
of sunctions like summary()
to get a more complete sense of
the types of data we have as part of this set of sites. The following
code gives the summary table. We do some R magic here to change the way
the data is displayed (turning it into a datatable()
object), but the main piece is the summary()
call.
neotoma2::summary(cz_sites)
We can see that there are no chronologies associated with the
site
objects. This is because, at present, we have not
pulled in the dataset
information we need. All we know from
get_sites()
are the kinds of datasets we have.
We know that collection units and datasets are contained within
sites. Similarly, a sites
object contains
collectionunits
which contain datasets
. From
the table above we can see that some of the sites we’ve looked at
contain pollen records. That said, we only have the sites
,
it’s just that (for convenience) the sites
API returns some
information about datasets so to make it easier to navigate the
records.
With a sites
object we can directly call
get_datasets()
, to pull in more metadata about the
datasets. At any time we can use datasets()
to get more
information about any datasets that a sites
object may
contain. Compare the output of datasets(cz_sites)
to the
output of a similar call using the following:
cz_datasets <- neotoma2::get_datasets(cz_sites, all_data = TRUE)
datasets(cz_datasets)
If we choose to pull in information about only a single dataset type,
or if there is additional filtering we want to do before we download the
data, we can use the filter()
function. For example, if we
only want pollen records, and want records with known chronologies, we
can filter:
cz_pollen <- cz_datasets %>%
neotoma2::filter(datasettype == "pollen" & !is.na(age_range_young))
neotoma2::summary(cz_pollen)
We can see now that the data table looks different, and there are fewer total sites.
sample()
data.Because sample data adds a lot of overhead (for the Czech pollen
data, the object that includes the dataset with samples is 20 times
larger than the dataset
alone), we try to call
get_downloads()
after we’ve done our preliminary filtering.
After get_datasets()
you have enough information to filter
based on location, time bounds and dataset type. When we move to
get_download()
we can do more fine-tuned filtering at the
analysis unit or taxon level.
The following call can take some time, but we’ve frozen the object as an RDS data file. You can run this command on your own, and let it run for a bit, or you can just load the object in.
## This line is commented out because we've already run it for you.
## cz_dl <- cz_pollen %>% get_downloads(all_data = TRUE)
cz_dl <- readRDS('data/czDownload.RDS')
Once we’ve downloaded, we now have information for each site about all the associated collection units, the datasets, and, for each dataset, all the samples associated with the datasets. To extract all the samples we can call:
allSamp <- samples(cz_dl)
When we’ve done this, we get a data.frame
that is 130889
rows long and 37 columns wide. The reason the table is so wide is that
we are returning data in a long format. Each row
contains all the information you should need to properly interpret
it:
## [1] "age" "agetype" "ageolder" "ageyounger"
## [5] "chronologyid" "chronologyname" "units" "value"
## [9] "context" "element" "taxonid" "symmetry"
## [13] "taxongroup" "elementtype" "variablename" "ecologicalgroup"
## [17] "analysisunitid" "sampleanalyst" "sampleid" "depth"
## [21] "thickness" "samplename" "datasetid" "siteid"
## [25] "sitename" "lat" "long" "area"
## [29] "sitenotes" "description" "elev" "collunitid"
## [33] "database" "datasettype" "age_range_old" "age_range_young"
## [37] "datasetnotes"
For some dataset types, or analyses some of these columns may not be
needed, however, for other dataset types they may be critically
important. To allow the neotoma2
package to be as useful as
possible for the community we’ve included as many as we can.
If you want to know what taxa we have in the record you can use the
helper function taxa()
on the sites object. The
taxa()
function gives us, not only the unique taxa, but two
additional columns, sites
and samples
that
tell us how many sites the taxa appear in, and how many samples the taxa
appear in, to help us better understand how common individual taxa
are.
neotomatx <- neotoma2::taxa(cz_dl)
The taxonid
values can be linked to the
taxonid
column in the samples()
. This allows
us to build taxon harmonization tables if we choose to. You may also
note that the taxonname
is in the field
variablename
. Individual sample counts are reported in
Neotoma as variables
.
A “variable” may be either a species, something like laboratory
measurements, or a non-organic proxy, like charcoal or XRF measurements,
and includes the units of measurement and the value.
Lets say we want all samples from which Plantago taxa have
been reported to be grouped together into one pseudo-taxon called
Plantago. There are several ways of doing this, either directly
by exporting the file and editing each individual cell, or by creating
an external “harmonization” table (which we did in the prior
neotoma
package).
Programmatically, we can harmonize taxon by taxon using matching and
transformation. We’re using dplyr
type coding here to
mutate()
the column variablename
so that any
time we detect (str_detect()
) a variablename
that starts with Plantago
(the .*
represents a
wildcard for any character [.
], zero or more times
[*
]) we replace()
it with the character string
"Plantago"
. Note that this changes Plantago in the
allSamp
object, but if we were to call
samples()
again, the taxonomy would return to its original
form.
We’re going to filter the ecological groups to include only UPHE (upleand/heath) and TRSH (trees and shrubs). More information about ecological groups is available from the Neotoma Online Manual.
allSamp <- allSamp %>%
dplyr::filter(ecologicalgroup %in% c("UPHE", "TRSH")) %>%
mutate(variablename = replace(variablename,
stringr::str_detect(variablename, "Plantago.*"),
"Plantago"))
There were originally 15 different taxa identified as being within the genus Plantago (including Plantago, Plantago major, and Plantago alpina-type). The above code reduces them all to a single taxonomic group Plantago.
If we want to have an artifact of our choices, we can use an external table. For example, a table of pairs (what we want changed, and the name we want it replaced with) can be generated, and it can include regular expressions (if we choose):
original | replacement |
---|---|
Abies.* | Abies |
Vaccinium.* | Ericaceae |
Typha.* | Aquatic |
Nymphaea | Aquatic |
… | … |
We can get the list of original names directly from the
taxa()
call, applied to a sites
object, and
then export it using write.csv()
.
taxaplots <- taxa(cz_dl)
# Save the taxon list to file so we can edit it subsequently.
readr::write_csv(taxaplots, "data/mytaxontable.csv")
The plot is mostly for illustration, but we can see, as a sanity check, that the relationship is as we’d expect.
You can then export either one of these tables and add a column with
the counts, you could also add extra contextual information, such as the
ecologicalgroup
or taxongroup
to help you out.
Once you’ve cleaned up the translation table you can load it in, and
then apply the transformation:
translation <- readr::read_csv("data/taxontable.csv")
You can see we’ve changed some of the taxon names in the taxon table
(don’t look too far, I just did this as an example). To replace the
names in the samples()
output, we’ll join the two tables
using an inner_join()
(meaning the
variablename
must appear in both tables for the result to
be included), and then we’re going to select only those elements of the
sample tables that are relevant to our later analysis:
allSamp <- samples(cz_dl)
allSamp <- allSamp %>%
inner_join(translation, by = c("variablename" = "variablename")) %>%
dplyr::select(!c("variablename", "sites", "samples")) %>%
group_by(siteid, sitename, replacement,
sampleid, units, age,
agetype, depth, datasetid,
long, lat) %>%
summarise(value = sum(value), .groups='keep')
We can use packages like rioja
to do stratigraphic
plotting for a single record, but first we need to do some different
data management. Although we could do harmonization again we’re going to
simply take the top ten most common taxa at a single site and plot them
in a stratigraphic diagram.
We’re using the arrange()
call to sort by the number of
times that the taxon appears within the core. This way we can take out
samples and select the taxa that appear in the first ten rows of the
plottingTaxa
data.frame
.
# Get a particular site, select only taxa identified from pollen (and only trees/shrubs)
plottingSite <- cz_dl[[1]]
plottingTaxa <- taxa(plottingSite) %>%
filter(ecologicalgroup %in% c("TRSH")) %>%
filter(elementtype == "pollen") %>%
arrange(desc(samples)) %>%
head(n = 10)
# Clean up. Select only pollen measured using NISP.
# We repeat the filters for pollen & ecological group on the samples
shortSamples <- samples(plottingSite) %>%
filter(variablename %in% plottingTaxa$variablename) %>%
filter(ecologicalgroup %in% c("TRSH")) %>%
filter(elementtype == "pollen") %>%
filter(units == "NISP")
# Transform to proportion values.
onesite <- shortSamples %>%
group_by(age) %>%
mutate(pollencount = sum(value, na.rm = TRUE)) %>%
group_by(variablename) %>%
mutate(prop = value / pollencount) %>%
arrange(desc(age))
# Spread the data to a "wide" table, with taxa as column headings.
widetable <- onesite %>%
dplyr::select(age, variablename, prop) %>%
mutate(prop = as.numeric(prop))
counts <- tidyr::pivot_wider(widetable,
id_cols = age,
names_from = variablename,
values_from = prop,
values_fill = 0)
This appears to be a fairly long set of commands, but the code is
pretty straightforward, and it provides you with significant control
over the taxa, units and other elements of your data before you get them
into the wide matrix (depth
by taxon
) that
most statistical tools such as the vegan
package or
rioja
use.
To plot we can use rioja
’s strat.plot()
,
sorting the taxa using weighted averaging scores
(wa.order
). I’ve also added a CONISS plot to the edge of
the the plot, to show how the new wide data frame works with
distance metric funcitons.
clust <- rioja::chclust(dist(sqrt(counts)),
method = "coniss")
plot <- rioja::strat.plot(counts[,-1] * 100, yvar = counts$age,
title = cz_dl[[1]]$sitename,
ylabel = "Calibrated Years BP",
xlabel = "Pollen (%)",
y.rev = TRUE,
clust = clust,
wa.order = "topleft", scale.percent = TRUE)
rioja::addClustZone(plot, clust, 4, col = "red")
We now have site information across the Czech Republic, with samples, and with taxon names. I’m interested in looking at the distributions of taxa across time, their presence/absence. I’m going to pick the top 20 taxa (based on the number of times they appear in the records) and look at their distributions in time:
plottingTaxa <- taxa(plottingSite) %>%
filter(ecologicalgroup %in% c("TRSH")) %>%
filter(elementtype == "pollen") %>%
arrange(desc(sites)) %>%
head(n = 20)
taxabyage <- samples(cz_dl) %>%
filter(variablename %in% plottingTaxa$variablename) %>%
group_by(variablename, "age" = round(age * 2, -3) / 2) %>%
summarise(n = length(unique(siteid)), .groups = 'keep')
samplesbyage <- samples(cz_dl) %>%
filter(variablename %in% plottingTaxa$variablename) %>%
group_by("age" = round(age * 2, -3) / 2) %>%
summarise(samples = length(unique(siteid)), .groups = 'keep')
groupbyage <- taxabyage %>%
inner_join(samplesbyage, by = "age") %>%
mutate(proportion = n / samples)
ggplot(groupbyage, aes(x = age, y = proportion)) +
geom_point() +
geom_smooth(method = 'gam',
method.args = list(family = 'binomial')) +
facet_wrap(~variablename) +
coord_cartesian(xlim = c(20000, 0), ylim = c(0, 1)) +
scale_x_reverse(breaks = c(10000, 20000)) +
xlab("Proportion of Sites with Taxon") +
theme_bw()
We can see clear patterns of change, and the smooths are modeled
using Generalized Additive Models (GAMs) in R, so we can have more or
less control over the actual modeling using the gam
or
mgcv
packages. Depending on how we divide the data we can
also look at shifts in altitude, latitude or longitude to better
understand how species distributions and abundances changed over time in
this region.
We are often interested in the interaction between taxa and climate,
assuming that time is a proxy for changing environments. The development
of large-scale global datasets for climate has made it relatively
straightforward to access data from the cloud in raster format. R
provides a number of tools (in the sf
and
raster
packages) for managing spatial data, and providing
support for spatial analysis of data.
The first step is taking our sample data and turning it into a
spatial object using the sf
package in R:
modern <- samples(cz_dl) %>%
filter(age < 50) %>%
filter(ecologicalgroup == "TRSH" & elementtype == "pollen" & units == "NISP")
spatial <- sf::st_as_sf(modern,
coords = c("long", "lat"),
crs = "+proj=longlat +datum=WGS84")
The data is effectively the same, sf
makes an object
called spatial
that is a data.frame
with all
the information from samples()
, and a column
(geometry
) that contains the spatial data.
We can use the getData()
function in the raster
package to get climate data from
WorldClim. The operations that follow here can be applied to any sort of
raster data, provided it is loaded into R as a raster
object.
Here we pull in the raster data, at a 10 minute resolution for the
\(T_{max}\) variable, maximum monthly
temperature. The raster itself has 12 layers, one for each month. With
the extract()
function we just get information for the
seventh month, July.
worldTmax <- raster::getData('worldclim', var = 'tmax', res = 10)
spatial$tmax7 <- raster::extract(worldTmax, spatial)[,7]
This adds a column to the data.frame
spatial
, that contains the maximum July temperature for
each taxon at each site (all taxa at a site will share the same value).
We’ve already filtered to all UPHE taxa, but that still leaves us with 1
distinct names for the taxa. We’re going to use dplyr
’s
mutate()
function to extract just the genus:
spatial <- spatial %>%
mutate(variablename = stringr::str_replace(variablename, "[[:punct:]]", " ")) %>%
mutate(variablename = stringr::word(variablename, 1)) %>%
group_by(variablename, siteid) %>%
summarise(tmax7 = max(tmax7), .groups = "keep") %>%
group_by(variablename) %>%
filter(n() > 3)
We want to get the background distribution of July temperatures in the Czech Republic, to plot our taxon distributions against by taking the maximum value of the temperature, however, since all values at the site are the same (because we used a spatial overlay) the maximum is the same as the actual July temperature at the site.
maxsamp <- spatial %>%
dplyr::group_by(siteid) %>%
dplyr::summarise(tmax7 = max(tmax7), .groups = 'keep')
Now we’re going to plot it out, using facet_wrap()
to
plot each taxon in its own panel:
ggplot() +
geom_density(data = spatial,
aes(x = round(tmax7 / 10, 0)), col = 2) +
facet_wrap(~variablename) +
geom_density(data = maxsamp, aes(x = tmax7 / 10)) +
xlab("Maximum July Temperature") +
ylab("Kernel Density")
So, we’ve done a lot in this example. We’ve (1) searched for sites using site names and geographic parameters, (2) filtered results using temporal and spatial parameters, (3) obtained sample information for the selected datasets and (4) performed basic analysis including the use of climate data from rasters. Hopefully you can use these examples as templates for your own future work, or as a building block for something new and cool!