This report details changes to the Neotoma Paleoecology database since 2022-10-13, and is current up to 2023-10-13. Full documentation of the database can be found in the Neotoma Database Manual. Recent snapshots of the database can be obtained from the Neotoma Snapshot website. This report is generated automatically from an RMarkdown document hosted on GitHub.
Neotoma contains data from 48,245 datasets and 21,762 unique sites. This represents a considerable contribution from members of the scientific community, including 3,842 primary investigators, 1,281 analysts, and stewards for all 37 constituent databases. There are also invaluable and incalculable contributions from the members of the Neotoma Paleoecology Database Community.
Of the 1,180 sites added, 241 have been entered as polygons, while 939 are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).
Of the 2893 datasets added to Neotoma over the past year, there have been contributions to 18 constituent databases, with the majority from Academy of Natural Sciences of Drexel University. This pattern of contribution is reflected in contributions to dataset types, where we see contributions to 21 dataset types.
Neotoma consists of 37 constituent databases. At any one time some databases may be more active than others.
Neotoma relies on significant the efforts of a volunteer group of data stewards and data contributors. Over the last 12 months 26 stewards have contributed data to Neotoma, across a range of constituent databases.
Since the API has been implemented there have been a total of
4,549,634 calls to the Neotoma API. These include calls to the core API
(api.neotomadb.org
),
calls to support the Neotoma Landing Pages (data.neotomadb.org
)
and calls to support Neotoma Explorer (apps.neotomadb.org/explorer
).
The main APIs delivered a total of 3 GB of data to users over the last year, with the most significant payload beginning in approximately October 2020.
Average response time for the web services was 296ms, with a maximum response time of 71.04sec. Approximately 5.03% of all responses took more than one second to return data.
Several API calls are called thousands of times, but these are not necessarily the fastest, or slowest queries. There is no relationship between speed and the number of times an API endpoint is used. The most frequent API calls are:
The slowest API calls (with the slowest median response time) are only shown for calls with more than 100 instances:
Sites can be added as either points or polygons. Of the 1,180 sites added, 241 of those are entered as site polygons, while 939 are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).
Among the 1,248 sites added to the Neotoma Paleoecology Database in the past year, not all sites were entered with complete metadata. Complete metadata is critical for better understanding data context, particularly when site notes & descriptions are required to better understand data.
There are 46,259 taxa recorded in the Neotoma Taxonomy table. These are not exclusively taxonomic records, but include other variables, such as laboratory measurements and other detected features within samples.
Taxonomic records are structured hierarchically, with
highertaxonid
pointing to the next highest
taxonid
in the database. These hierarchies do not
necessarily reflect taxonomic hierarchy. Issues with taxon
hierarchy may be the result of improper identification of high level
taxa, failure to identify high level taxa, or duplicate records were
multiple higher level taxa are identified.
The highest-level taxa can be identified because they have
taxonid==highertaxonid
. Within the database there are 29
highest level taxa:
This table is provided largely for information, to help identify records that are identified as “highest level”, that should be otherwise grouped.
There are 37,530 taxa that represent “leaves” in the Neotoma taxon
tree. Of these, 9,566 have no recorded counts (the taxonid
does not appear in the ndb.variables
table). These are taxa
that are not part of a morpohotaxonomic hierarchy (so there are no
dependent taxa), and also have no associated sample records:
Some taxa do not have defined highertaxonid
values.
Currently there is a count of 2823 taxa without defined higher taxon
IDs. It is unclear why these taxa do not have related higher taxonomic
elements.
Taxa are identified by taxonname
and
taxagroupid
. There are instances of duplicate
taxonname
, but these should be represented by distinct
taxagroupid
values. There are 65 taxa where the
taxonname
is duplicated (and the taxon is
valid
).
It is possible to have duplicate taxon codes in the database provided the taxa are within different taxon group IDs. However, there may be instances where a taxon code is repeated within the same group. The following taxon identifiers are repeated multiple times within an ecological group:
Although taxonomies are continually updated, Neotoma provides the
ability to have users enter the original taxonomic information, and then
reference particular synonomies, associated with particular
publications, or attributed to specific Neotoma stewards or contacts.
This relies on several interacting tables, in particular
ndb.synonyms
, and ndb.synonomy
.
ndb.synonyms
indicates the links between taxa (in this case
validtaxonid
and invalidtaxonid
).
Critically, there is no direct PK/FK link between
these tables. Thus, it is possible for a synonymy at the dataset level
to have no attribution for the synonymy. While ndb.synonyms
also provides the opportunity to define a synonymtype
, the
synonymy
does not, except by relating the
validtaxonid
and invalidtaxonid
in
ndb.synonyms
to the taxonid
and
reftaxonid
of ndb.synonymy
.
The database currently contains 10,295 datasets with synonymys, and a total of 2,821 attributed synonyms. Of the synonyms with associated datastids, there are 853 synonymys without links in the synonyms table. There are 1 synonyms where there is no attributed contactid or publication.
There are 214 synonymys where multiple different publications are used to attribute the synonymy. There are also 520 where multiple different individuals are identified as assigning the synonym. There are 1,439 synonyms without any associated publication.
We use variable IDs (PK: ndb.variables.variableid
) to
link a taxon, the element, context and units. In general, we don’t
expect that these should ever be duplicated, since we can use the same
variable ID over and over again, for the given combination. Having said
that, we do see replication, and it’s not clear why.
In 52 variables we see that there is duplication of the keys in the
variableids
. Interestingly it seems that this is an issue
that primarily affects the mammal records:
The ground sloth Paramylodon harlani
seems to have the
biggest issues. Some possible reasons for this larger issue may be
associated with the ways “specimens” are added to the database,
potentially causing a conflict. This issue should possibly be flagged as
a situation where we could add a composite primary key to the table.
Issues with sites include sites with no associated datasets, duplicated sites and, potentially, sites with missing data.
When we examine sites, we find that there are 646 sites with exactly duplicated site geometries. These sites are distributed globally, and distributed across constituent databases.
Some sites appear to have been submitted, but have no associated collectionunit or dataset data:
This is likely the result of failed uploads during the Tilia upload process (see Tilia upload reference here). To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.
Some sites appear to have been submitted, with collection units but no registered analysis units associated with them:
This is also likely the result of failed uploads during the Tilia upload process. To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.
A total of 20,324 calls to the Tilia API were made over the last year (for which tracking remains available). This represents the transfer of 694MB of data.
Over the past year stewards accessed the database to modify data 141 times. This represents access by 21 distinct stewards.