1 Data Summary

1.1 Overall Database Summary

This report details changes to the Neotoma Paleoecology database since 2022-10-13, and is current up to 2023-10-13. Full documentation of the database can be found in the Neotoma Database Manual. Recent snapshots of the database can be obtained from the Neotoma Snapshot website. This report is generated automatically from an RMarkdown document hosted on GitHub.

Neotoma contains data from 48,245 datasets and 21,762 unique sites. This represents a considerable contribution from members of the scientific community, including 3,842 primary investigators, 1,281 analysts, and stewards for all 37 constituent databases. There are also invaluable and incalculable contributions from the members of the Neotoma Paleoecology Database Community.

1.1.1 Recent Data Updates

1.1.1.1 Site Additions

Figure 1. Locations of newly added sites in Neotoma during the past year. The map is interactive and supports zoom/pan operations. Individual sites can be selected and you will be directed to a link for the Neotoma Explorer.

Of the 1,180 sites added, 241 have been entered as polygons, while 939 are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).

1.1.2 Dataset Additions

Of the 2893 datasets added to Neotoma over the past year, there have been contributions to 18 constituent databases, with the majority from Academy of Natural Sciences of Drexel University. This pattern of contribution is reflected in contributions to dataset types, where we see contributions to 21 dataset types.

Dataset contributions to Neotoma over the previous 18 months. The large number of Neotoma dataset types makes color coding difficult, however results are detailed specifically in Table 1.

Dataset contributions to Neotoma over the previous 18 months. The large number of Neotoma dataset types makes color coding difficult, however results are detailed specifically in Table 1.

1.1.3 Constituent Databases

Neotoma consists of 37 constituent databases. At any one time some databases may be more active than others.

1.1.4 Contributors

Neotoma relies on significant the efforts of a volunteer group of data stewards and data contributors. Over the last 12 months 26 stewards have contributed data to Neotoma, across a range of constituent databases.

1.1.5 API Calls

Since the API has been implemented there have been a total of 4,549,634 calls to the Neotoma API. These include calls to the core API (api.neotomadb.org), calls to support the Neotoma Landing Pages (data.neotomadb.org) and calls to support Neotoma Explorer (apps.neotomadb.org/explorer).

The main APIs delivered a total of 3 GB of data to users over the last year, with the most significant payload beginning in approximately October 2020.

Average response time for the web services was 296ms, with a maximum response time of 71.04sec. Approximately 5.03% of all responses took more than one second to return data.

1.1.5.1 Specific API Calls

Several API calls are called thousands of times, but these are not necessarily the fastest, or slowest queries. There is no relationship between speed and the number of times an API endpoint is used. The most frequent API calls are:

The slowest API calls (with the slowest median response time) are only shown for calls with more than 100 instances:

1.2 Data Overview

1.2.1 Site Spatial Types

Sites can be added as either points or polygons. Of the 1,180 sites added, 241 of those are entered as site polygons, while 939 are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).

1.2.2 Site Metadata

Among the 1,248 sites added to the Neotoma Paleoecology Database in the past year, not all sites were entered with complete metadata. Complete metadata is critical for better understanding data context, particularly when site notes & descriptions are required to better understand data.

1.2.2.1 Dataset Metadata

1.2.3 Taxon Overview

There are 46,259 taxa recorded in the Neotoma Taxonomy table. These are not exclusively taxonomic records, but include other variables, such as laboratory measurements and other detected features within samples.

1.2.3.1 Taxon Hierarchy

Taxonomic records are structured hierarchically, with highertaxonid pointing to the next highest taxonid in the database. These hierarchies do not necessarily reflect taxonomic hierarchy. Issues with taxon hierarchy may be the result of improper identification of high level taxa, failure to identify high level taxa, or duplicate records were multiple higher level taxa are identified.

1.2.3.1.1 Highest-Level Taxa

The highest-level taxa can be identified because they have taxonid==highertaxonid. Within the database there are 29 highest level taxa:

This table is provided largely for information, to help identify records that are identified as “highest level”, that should be otherwise grouped.

1.2.3.1.2 Taxa with no relationships

There are 37,530 taxa that represent “leaves” in the Neotoma taxon tree. Of these, 9,566 have no recorded counts (the taxonid does not appear in the ndb.variables table). These are taxa that are not part of a morpohotaxonomic hierarchy (so there are no dependent taxa), and also have no associated sample records:

1.2.3.1.3 Taxa with Undefined Higher Taxa

Some taxa do not have defined highertaxonid values. Currently there is a count of 2823 taxa without defined higher taxon IDs. It is unclear why these taxa do not have related higher taxonomic elements.

1.2.3.2 Duplicated Taxa

Taxa are identified by taxonname and taxagroupid. There are instances of duplicate taxonname, but these should be represented by distinct taxagroupid values. There are 65 taxa where the taxonname is duplicated (and the taxon is valid).

1.2.3.2.1 Duplicated Taxon Codes

It is possible to have duplicate taxon codes in the database provided the taxa are within different taxon group IDs. However, there may be instances where a taxon code is repeated within the same group. The following taxon identifiers are repeated multiple times within an ecological group:

1.2.3.3 Taxon Synonymys

Although taxonomies are continually updated, Neotoma provides the ability to have users enter the original taxonomic information, and then reference particular synonomies, associated with particular publications, or attributed to specific Neotoma stewards or contacts. This relies on several interacting tables, in particular ndb.synonyms, and ndb.synonomy. ndb.synonyms indicates the links between taxa (in this case validtaxonid and invalidtaxonid).

Critically, there is no direct PK/FK link between these tables. Thus, it is possible for a synonymy at the dataset level to have no attribution for the synonymy. While ndb.synonyms also provides the opportunity to define a synonymtype, the synonymy does not, except by relating the validtaxonid and invalidtaxonid in ndb.synonyms to the taxonid and reftaxonid of ndb.synonymy.

The database currently contains 10,295 datasets with synonymys, and a total of 2,821 attributed synonyms. Of the synonyms with associated datastids, there are 853 synonymys without links in the synonyms table. There are 1 synonyms where there is no attributed contactid or publication.

There are 214 synonymys where multiple different publications are used to attribute the synonymy. There are also 520 where multiple different individuals are identified as assigning the synonym. There are 1,439 synonyms without any associated publication.

1.2.3.4 Duplicated Variables

We use variable IDs (PK: ndb.variables.variableid) to link a taxon, the element, context and units. In general, we don’t expect that these should ever be duplicated, since we can use the same variable ID over and over again, for the given combination. Having said that, we do see replication, and it’s not clear why.

In 52 variables we see that there is duplication of the keys in the variableids. Interestingly it seems that this is an issue that primarily affects the mammal records:

The ground sloth Paramylodon harlani seems to have the biggest issues. Some possible reasons for this larger issue may be associated with the ways “specimens” are added to the database, potentially causing a conflict. This issue should possibly be flagged as a situation where we could add a composite primary key to the table.

1.2.4 Sites and Datasets

Issues with sites include sites with no associated datasets, duplicated sites and, potentially, sites with missing data.

When we examine sites, we find that there are 646 sites with exactly duplicated site geometries. These sites are distributed globally, and distributed across constituent databases.

1.2.4.1 Sites without CollectionUnits or Datasets

Some sites appear to have been submitted, but have no associated collectionunit or dataset data:

This is likely the result of failed uploads during the Tilia upload process (see Tilia upload reference here). To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.

1.2.4.2 Sites without analysis units

Some sites appear to have been submitted, with collection units but no registered analysis units associated with them:

This is also likely the result of failed uploads during the Tilia upload process. To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.

1.2.5 Stewards & Tilia Usage

A total of 20,324 calls to the Tilia API were made over the last year (for which tracking remains available). This represents the transfer of 694MB of data.

Over the past year stewards accessed the database to modify data 141 times. This represents access by 21 distinct stewards.