This document is intended to act as a primer for the use of the
Neotoma R package, neotoma2
and is the companion to the Introduction
to Neotoma presentation. Some users may be working with this
document as part of a workshop for which there is a Binder instance. The
Binder instance will run RStudio in your browser, with all the required
packages installed.
If you are using this workflow on its own, or want to use the package
directly, the
neotoma2
package is available on CRAN by running:
install.packages("remotes")
devtools::install_github("NeotomaDB/neotoma2@dev")
library(neotoma2)
Your version should be at or above 1.0.6.
This workshop will also require other packages. To maintain the flow of this document we’ve placed instructions at the end of the document in the section labelled “Installing packages on your own”. Please install these packages, and make sure they are at the lastest version.
In this tutorial you will learn how to:
If you’re planning on working with Neotoma, please join us on Slack where we manage a channel specifically for questions about the R package (the #it_r channel, or #it_r_es for R help in Spanish and #it_r_jp in Japanese). You may also wish to join the Neotoma community through our Google Groups mailing lists; please see the information on our website to be added.
Data in the Neotoma database itself is structured as a set of linked relationships to express different elements of paleoecological analysis:
These relationships can be complex because paleoecology is a broad and evolving discipline. As such, the database itself is highly structured, and normalized, to allow new relationships and facts to be added, while maintaining a stable central data model. If you want to better understand concepts within the database, you can read the Neotoma Database Manual, or take a look at the database schema itself.
In this workshop we want to highlight two key structural concepts:
neotoma2
R package.Data in Neotoma is associated with sites – specific locations with latitude and longitude coordinates.
Within a site, there may be one or more collection units – locations at which samples are physically collected within the site:
Collection units may have higher resolution GPS locations than the site location, but are considered to be part of the broader site.
Data within a collection unit is collected at various analysis units.
Any data sampled within an analysis unit is grouped by the dataset type (charcoal, diatom, dinoflagellate, etc.) and aggregated into a sample. The set of samples for a collection unit of a particular dataset type is then assigned to a dataset.
neotoma2
sites
object has a property sites
, that is a
list. The function plotLeaflet()
can be used on a
sites
object.If we look at the UML
diagram for the objects in the neotoma2
R package we
can see that the data structure generally mimics the structure within
the database itself. As we will see in the Site Searches section, we can search for
these objects, and begin to manipulate them (in the Simple Analysis section).
It is important to note: within the neotoma2
R
package, most objects are sites
objects, they just contain
more or less data. There are a set of functions that can operate on
sites
. As we add to sites
objects, using
get_datasets()
or get_downloads()
, we are able
to use more of these helper functions.
get_sites()
There are several ways to find sites in neotoma2
, but we
think of sites
as being spatial objects primarily. They
have names, locations, and are found within the context of geopolitical
units, but within the API and the package, the site itself does not have
associated information about taxa, dataset types or ages. It is simply
the container into which we add that information. So, when we search for
sites we can search by:
Parameter | Description |
---|---|
sitename | A valid site name (case insensitive) using % as a
wildcard. |
siteid | A unique numeric site id from the Neotoma Database |
loc | A bounding box vector, geoJSON or WKT string. |
altmin | Lower altitude bound for sites. |
altmax | Upper altitude bound for site locations. |
database | The constituent database from which the records are pulled. |
datasettype | The kind of dataset (see get_tables(datasettypes) ) |
datasetid | Unique numeric dataset identifier in Neotoma |
doi | A valid dataset DOI in Neotoma |
gpid | A unique numeric identifier, or text string identifying a geopolitical unit in Neotoma |
keywords | Unique sample keywords for records in Neotoma. |
contacts | A name or numeric id for individuals associuated with sites. |
taxa | Unique numeric identifiers or taxon names associated with sites. |
All sites in Neotoma contain one or more datasets. It’s worth noting that the results of these search parameters may be slightly unexpected. For example, searching for sites by sitename, latitude, or altitude will return all of the datasets for the particular site. Searching for terms such as datasettype, datasetid or taxa will return the site, but the only datasets returned will be those matching the dataset-specific search terms. We’ll see this later.
sitename="%cave%"
We may know exactly what site we’re looking for (“Wonderwerk Cave”), or have an approximate guess for the site name (for example, we know it’s something like “Wonderwerk”, but we’re not sure how it was entered specifically), or we may want to search all sites that have a specific term, for example, cave.
We use the general format: get_sites(sitename="%cave%")
for searching by name.
PostgreSQL (and the API) uses the percent sign as a wildcard. So
"%cave%"
would pick up “Wonderwerk Cave” for us
(and picks up “Equus Cave” and “Spring Cave shelter”). Note that the
search query is also case insensitive.
Sys.setenv(APIPOINT="dev")
cave_sites <- neotoma2::get_sites(sitename = "%cave%")
plotLeaflet(cave_sites)
loc=c()
The original neotoma
package used a bounding box for
locations, structured as a vector of latitude and longitude values:
c(xmin, ymin, xmax, ymax)
. The neotoma2
R
package supports both this simple bounding box, but also more complex
spatial objects, using the sf
package.
Using the sf
package allows us to more easily work with
raster and polygon data in R, and to select sites from more complex
spatial objects. The loc
parameter works with the simple
vector, WKT, geoJSON objects and native
sf
objects in R.
As an example of searching for sites using a location, we’ve created
a rough representation of Africa as a polygon. To work with this spatial
object in R we also transformed the geoJSON
element to an
object for the sf
package. There are many other tools to
work with spatial objects in R. Regardless of how you get the data into
R, neotoma2
works with almost all objects in the
sf
package.
geoJSON <- '{"type": "Polygon",
"coordinates": [[
[-7.030, 36.011],
[-18.807, 23.537],
[-19.247, 10.282],
[-9.139, -0.211],
[18.370, -37.546],
[35.069, -36.352],
[49.571, -27.097],
[58.185, 0.755],
[53.351, 13.807],
[43.946, 12.008],
[31.202, 33.629],
[18.897, 34.648],
[12.393, 35.583],
[11.075, 38.184],
[-7.030, 36.011]
]
]}'
africa_sf <- geojsonsf::geojson_sf(geoJSON)
# Note here we use the `all_data` flag to capture all the sites within the polygon.
# We're using `all_data` here because we know that the site information is relatively small
# for Africa. If we were working in a new area or with a new search we would limit the
# search size.
africa_sites <- neotoma2::get_sites(loc = africa_sf, all_data = TRUE)
You can always simply plot()
the sites
objects, but you will lose some of the geographic context. The
plotLeaflet()
function returns a leaflet()
map, and allows you to further customize it, or add additional spatial
data (like our original bounding polygon, sa_sf
, which
works directly with the R leaflet
package):
Note the use of the %>%
pipe here. If you are not
familiar with this symbol, check our “Piping in
R” section of the Appendix.
neotoma2::plotLeaflet(africa_sites) %>%
leaflet::addPolygons(map = .,
data = africa_sf,
color = "green")
site
Object HelpersIf we look at the data
structure diagram for the objects in the neotoma2
R
package we can see that there are a set of functions that can operate on
sites
. As we retrieve more information for
sites
objects, using get_datasets()
or
get_downloads()
, we are able to use more of these helper
functions.
As it is, we can take advantage of functions like
summary()
to get a more complete sense of the types of data
we have in africa_sites
. The following code gives the
summary table. We do some R magic here to change the way the data is
displayed (turning it into a DT::datatable()
object), but the main piece is the summary()
call.
# Give information about the sites themselves, site names &cetera.
neotoma2::summary(africa_sites)
# Give the unique identifiers for sites, collection units and datasets found at those sites.
neotoma2::getids(africa_sites)
In this document we list only the first 10 records (there are more,
you can use length(datasets(africa_sites))
to see how many
datasets you’ve got). We can see that there are no chronologies
associated with the site
objects. This is because, at
present, we have not pulled in the dataset
information we
need. In Neotoma, a chronology is associated with a collection unit (and
that metadata is pulled by get_datasets()
or
get_downloads()
). All we know from get_sites()
are the kinds of datasets we have and the location of the sites that
contain the datasets.
get_datasets()
Within Neotoma, collection units and datasets are contained within
sites. Similarly, a sites
object contains
collectionunits
which contain datasets
. From
the table above (Result tab in Section 3.1.3.2) we can see that some of
the sites we’ve looked at contain pollen records, some contain
geochronologic data and some contain other dataset types. We could write
something like this: table(summary(africa_sites)$types)
to
see the different datasettypes and their counts.
With a sites
object we can directly call
get_datasets()
to pull in more metadata about the datasets.
The get_datasets()
method also supports any of the search
terms listed above in the Site Search
section. At any time we can use datasets()
to get more
information about any datasets that a sites
object may
contain. Compare the output of datasets(africa_sites)
to
the output of a similar call using the following:
# This may be slow, because there's a lot of sites!
africa_datasets <- neotoma2::get_datasets(loc = africa_sf, all_data = TRUE)
datasets(africa_datasets)
You can see that this provides information only about the specific
dataset, not the site! For a more complete record we can join site
information from summary()
to dataset information using
datasets()
using the getids()
function which
links sites, and all the collection units and datasets they contain.
filter()
RecordsIf we choose to pull in information about only a single dataset type,
or if there is additional filtering we want to do before we download the
data, we can use the filter()
function. For example, if we
only want sedimentary pollen records (as opposed to pollen surface
samples), and want records with known chronologies, we can filter by
datasettype
and by the presence of an
age_range_young
, which would indicate that there is a
chronology that defines bounds for ages within the record.
Note: At present filtering on some of the speleothem
entity parameters isn’t built into the neotoma2
R package.
We would love to work on this with folks!
africa_speleothem <- africa_datasets %>%
neotoma2::filter(datasettype == 'speleothem')
neotoma2::summary(africa_speleothem)
# We've removed records, so the new object should be shorter than the original.
length(africa_speleothem) < length(africa_datasets)
We can see now that the data table looks different (comparing it to the table above), and there are fewer total sites. Again, there is no explicit chronology for these records, we need to pull down the complete download for these records, but we begin to get a sense of what kind of data we have.
sample()
dataBecause sample data adds a lot of overhead (for this pollen data, the
object that includes the dataset with samples is 20 times larger than
the dataset
alone), we try to call
get_downloads()
only after we’ve done our preliminary
filtering. After get_datasets()
you have enough information
to filter based on location, time bounds and dataset type. When we move
to get_download()
we can do more fine-tuned filtering at
the analysis unit or taxon level.
The following call can take some time, but we’ve frozen the object as an RDS data file. You can run this command on your own, and let it run for a bit, or you can just load the object in.
## This line is commented out because we've already run it for you.
## africa_dl <- africa_datasets %>% get_downloads(all_data = TRUE)
## saveRDS(africa_dl, "data/africaDownload.RDS")
africa_dl <- readRDS("data/africaDownload.RDS")
Once we’ve downloaded, we now have information for each site about all the associated collection units, the datasets, and, for each dataset, all the samples associated with the datasets. To extract samples all downloads we can call:
allSamp <- samples(africa_dl)
When we’ve done this, we get a data.frame
that is 380641
rows long and 38 columns wide. The reason the table is so wide is that
we are returning data in a long format. Each row
contains all the information you should need to properly interpret
it:
## [1] "age" "agetype" "ageolder" "ageyounger"
## [5] "chronologyid" "chronologyname" "units" "value"
## [9] "context" "element" "taxonid" "symmetry"
## [13] "taxongroup" "elementtype" "variablename" "ecologicalgroup"
## [17] "analysisunitid" "sampleanalyst" "sampleid" "depth"
## [21] "thickness" "samplename" "datasetid" "database"
## [25] "datasettype" "age_range_old" "age_range_young" "age_units"
## [29] "datasetnotes" "siteid" "sitename" "lat"
## [33] "long" "area" "sitenotes" "description"
## [37] "elev" "collunitid"
For some dataset types or analyses, some of these columns may not be
needed, however, for other dataset types they may be critically
important. To allow the neotoma2
package to be as useful as
possible for the community we’ve included as many as we can.
What is critically important to note as you work is that any variable
is explicitly defined as the combination of the
variablename
, the units
, and then
context
ans symmetry
. This is because Neotoma
contains both instrument measured data, but also faunal remains and
plant element, so we might describe δ18O in per
mille units, but also the left half of an oxen skull measured using
MNI (minimum number of individuals).
It’s also important to note that the date of a single sample is
identified using the age
, and the agetype
,
along with a pointer to the chronology
used for the record
and the uncertainty (as ageolder
and
ageyounger
for the sample.
There are two tables associated with SISALv3
entities if
we want more information about them. In the future we will be adding
more support for speleothems in the R package, but for now we can
connect the individual entity attributes to the samples table using some
SQL magic in R:
entities <- get_table('speleothems', limit = 5000)
en_join <- get_table('speleothemcollectionunit', limit = 5000)
allSamp <- allSamp %>% mutate(collunitid = as.integer(collunitid))
samp_w_entity <- allSamp %>%
left_join(en_join, by = join_by(collunitid == collectionunitid)) %>%
left_join(entities, by = join_by(entityid == entityid))
As we move forward we will improve the integration of entity data within Neotoma.
If you want to know what has been masured in the records you can use
the helper function taxa()
on the sites object. The
taxa()
function gives us not only the unique taxa, but two
additional columns – sites
and samples
– that
tell us how many sites the taxa appear in, and how many samples the taxa
appear in, to help us better understand how common individual taxa
are.
neotomatx <- neotoma2::taxa(africa_dl)
# OR just the 'taxa' in the speleothems:
sisal_tx <- africa_dl %>% neotoma2::filter(datasettype == 'speleothem') %>% taxa()
Taxonomies in Neotoma are not as straightforward as we might expect. Taxonomic identification in paleoecology can be complex, impacted by the morphology of the object we are trying to identify, the condition of the palynomorph, the expertise of the analyst, and many other conditions. You can read more about concepts of taxonomy within Neotoma in the Neotoma Manual’s section on Taxonomic concepts.
We use the unique identifiers (e.g., taxonid
,
siteid
, analysisunitid
) throughout the
package, since they help us to link between records. The
taxonid
values returned by the taxa()
call can
be linked to the taxonid
column in the
samples()
table. This allows us to build taxon
harmonization tables if we choose to. You may also note that the
taxonname
is in the field variablename
.
Individual sample counts are reported in Neotoma as variables
.
A “variable” may be either a species, something like laboratory
measurements, or a non-organic proxy, like charcoal or XRF measurements,
and includes the units of measurement and the value.
To plot at strategraphic diargram we are only interested in one site and in one dataset. By looking at the summary of downloads we can see that Lake Solsø has two collection units that both have a pollen record. Lets look at the SOLSOE81 collection unit, which is the second download. To get the samples from just that one collection unit by specifying that you want only the samples from the second download.
We can use packages like rioja
to do stratigraphic
plotting for a single record, but first we need to do some different
data management. Although we could do harmonization again we’re going to
simply take the taxa at a single site and plot them in a stratigraphic
diagram. However, if you would like to plot multiple sites and you want
them to have harmonized taxa we have provided examples on how to do
both.
# Get a particular site, in this case we are simply subsetting the
# `africa_dl` object to get Lake Solsø:
plottingSite <- africa_dl %>%
neotoma2::filter(datasettype == 'speleothem')
# Select only pollen measured using NISP and convert to a "wide"
# table, using proportions. The first column will be "age".
# This turns our "long" table into a "wide" table:
values <- plottingSite[[2]] %>%
samples() %>%
toWide(ecologicalgroup = c("ITOP"),
unit = c("per mille"),
groupby = "age",
elementtype = NA,
operation = "sum") %>%
arrange(age) %>% na.omit()
Hopefully the code is pretty straightforward. The
toWide()
function provides you with significant control
over the taxa, units and other elements of your data before you get them
into the wide matrix (depth
by taxon
) that
most statistical tools such as the vegan
package or
rioja
use.
To plot the data we can use rioja
’s
strat.plot()
, sorting the taxa using weighted averaging
scores (wa.order
). I’ve also added a CONISS plot to the
edge of the the plot, to show how the new wide data frame works
with distance metric funcitons.
# Perform constrained clustering:
clust <- rioja::chclust(dist(values),
method = "coniss")
# Plot the stratigraphic plot, converting proportions to percentages:
plot <- rioja::strat.plot(values[,-1], yvar = values$age,
title = plottingSite[[2]]$sitename,
ylabel = "Years BP",
xlabel = "Isotope Measurements",
srt.xlabel = 70,
y.rev = TRUE,
clust = clust,
scale.percent = FALSE)
rioja::addClustZone(plot, clust, 4, col = "red")
spSamp <- allSamp %>%
filter(ecologicalgroup %in% c('TRSH', 'ITOP')) %>%
filter(units %in% c('per mille', 'NISP'))
spPct <- spSamp %>%
group_by(sitename, depth) %>%
mutate(val = ifelse(units == 'NISP', value / sum(value), value)) %>%
ungroup() %>%
filter(variablename %in% c('δ13C', 'δ18O', 'Pinus')) %>%
filter(!agetype == 'Calendar years AD/BC')
ggplot(spPct, aes(x = val, y = age, color = sitename)) +
theme(legend.position="none") +
geom_path() +
facet_wrap(~variablename, scale = 'free_x') +
coord_cartesian(ylim = c(0, 10000))
spSamp <- allSamp %>%
filter(ecologicalgroup %in% c('TRSH', 'ITOP')) %>%
filter(units %in% c('per mille', 'NISP'))
spPct <- spSamp %>%
group_by(sitename, depth) %>%
mutate(val = ifelse(units == 'NISP', value / sum(value), value)) %>%
ungroup() %>%
filter(variablename %in% c('δ13C', 'δ18O', 'Pinus'))
ggplot(spPct, aes(x = val, y = age)) +
theme(legend.position="none") +
geom_path() +
facet_wrap(~variablename, scale = 'free_x') +
ylim(c(0, 10000))
So, we’ve done a lot in this example. We’ve (1) searched for sites using site names and geographic parameters, (2) filtered results using temporal and spatial parameters, (3) obtained sample information for the selected datasets and (4) performed basic analysis including the use of climate data from rasters. Hopefully you can use these examples as templates for your own future work, or as a building block for something new and cool!
We use several packages in this document, including
leaflet
, sf
and others. We load the packages
using the pacman
package, which will automatically install
the packages if they do not currently exist in your set of packages.
options(warn = -1)
pacman::p_load(neotoma2, dplyr, ggplot2, sf, geojsonsf, leaflet, terra, DT, readr, stringr, rioja)
Note that R is sensitive to the order in which packages are loaded.
Using neotoma2::
tells R explicitly that you want to use
the neotoma2
package to run a particular function. So, for
a function like filter()
, which exists in other packages
such as dplyr
, you may see an error that looks like:
Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "sites"
In that case it’s likely that the wrong package is trying to run
filter()
, and so explicitly adding dplyr::
or
neotoma2::
in front of the function name (i.e.,
neotoma2::filter()
)is good practice.
R
Piping is a technique that simplifies the process of chaining
multiple operations on a data object. It involves using either of these
operators: |>
or %>%
. |>
is a base R operator while %>%
comes from the
tidyverse
ecosystem in R. In neotoma2
we use
%>%
.
The pipe operator works as a real-life pipe, which carries water from one location to another. In programming, the output of the function on the left-hand side of the pipe is taken as the initial argument for the function on the right-hand side of the pipe. It helps by making code easier to write and read. Additionally, it reduces the number of intermediate objects created during data processing, which can make code more memory-efficient and faster.
Without using pipes you can use the neotoma2
R package
to retrieve a site and then plot it by doing:
# Retrieve the site
plot_site <- neotoma2::get_sites(sitename = "%ø%")
# Plot the site
neotoma2::plotLeaflet(object = plot_site)
This would create a variable plot_site
that we will not
need any more, but it was necessary so that we could pass it to the
plotLeaflet
function.
With the pipe (%>%
) we do not need to create the
variable, we can just rewrite our code. Notice that
plotLeaflet()
doesn’t need the object
argument
because the response of get_sites(sitename = "%ø%")
gets
passed directly into the function.
# get_sites and pipe. The `object` parameter for plotLeaflet will be the
# result of the `get_sites()` function.
get_sites(sitename = "%ø%") %>%
plotLeaflet()>