This report details changes to the Neotoma Paleoecology database since 2023-09-04, and is current up to 2024-09-04. Full documentation of the database can be found in the Neotoma Database Manual. Recent snapshots of the database can be obtained from the Neotoma Snapshot website. This report is generated automatically from an RMarkdown document hosted on GitHub.
Neotoma contains data from 50,245 datasets and 23,086 unique sites. This represents a considerable contribution from members of the scientific community, including 4,171 primary investigators, 1,334 analysts, and stewards for all 39 constituent databases. There are also invaluable and incalculable contributions from the members of the Neotoma Paleoecology Database Community.
Of the 1,366 sites added, 65 have been entered as polygons, while 1,301 are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).
Of the 2055 datasets added to Neotoma over the past year, there have been contributions to 17 constituent databases, with the majority from North American Pollen Database. This pattern of contribution is reflected in contributions to dataset types, where we see contributions to 22 dataset types.
Neotoma consists of 39 constituent databases. At any one time some databases may be more active than others.
Neotoma relies on significant the efforts of a volunteer group of data stewards and data contributors. Over the last 12 months 24 stewards have contributed data to Neotoma, across a range of constituent databases.
Since the API has been implemented there have been a total of
1,851,766 calls to the Neotoma API. These include calls to the core API
(api.neotomadb.org
),
calls to support the Neotoma Landing Pages (data.neotomadb.org
)
and calls to support Neotoma Explorer (apps.neotomadb.org/explorer
).
The main APIs delivered a total of 15 GB of data to users since 2024-03-11.
Several API calls are called thousands of times, but these are not necessarily the fastest, or slowest queries. There is no relationship between speed and the number of times an API endpoint is used. The most frequent API calls are:
Sites can be added as either points or polygons. Of the 1,366 sites added, 65 of those are entered as site polygons, while 1,301 are entered as single coordinate points. In general polygons provide more complete information about the site, often representing the particular shape of the depositional environment (lake, archaeological site).
Among the 1,366 sites added to the Neotoma Paleoecology Database in the past year, not all sites were entered with complete metadata. Complete metadata is critical for better understanding data context, particularly when site notes & descriptions are required to better understand data.
There are 47,357 taxa recorded in the Neotoma Taxonomy table. These are not exclusively taxonomic records, but include other variables, such as laboratory measurements and other detected features within samples.
Taxonomic records are structured hierarchically, with
highertaxonid
pointing to the next highest
taxonid
in the database. These hierarchies do not
necessarily reflect taxonomic hierarchy. Issues with taxon
hierarchy may be the result of improper identification of high level
taxa, failure to identify high level taxa, or duplicate records were
multiple higher level taxa are identified.
The highest-level taxa can be identified because they have
taxonid==highertaxonid
. Within the database there are 29
highest level taxa:
This table is provided largely for information, to help identify records that are identified as “highest level”, that should be otherwise grouped.
There are 38,530 taxa that represent “leaves” in the Neotoma taxon
tree. Of these, 9,499 have no recorded counts (the taxonid
does not appear in the ndb.variables
table). These are taxa
that are not part of a morpohotaxonomic hierarchy (so there are no
dependent taxa), and also have no associated sample records:
Some taxa do not have defined highertaxonid
values.
Currently there is a count of 2834 taxa without defined higher taxon
IDs. It is unclear why these taxa do not have related higher taxonomic
elements.
Taxa are identified by taxonname
and
taxagroupid
. There are instances of duplicate
taxonname
, but these should be represented by distinct
taxagroupid
values. There are 70 taxa where the
taxonname
is duplicated (and the taxon is
valid
).
It is possible to have duplicate taxon codes in the database provided the taxa are within different taxon group IDs. However, there may be instances where a taxon code is repeated within the same group. The following taxon identifiers are repeated multiple times within an ecological group:
Although taxonomies are continually updated, Neotoma provides the
ability to have users enter the original taxonomic information, and then
reference particular synonomies, associated with particular
publications, or attributed to specific Neotoma stewards or contacts.
This relies on several interacting tables, in particular
ndb.synonyms
, and ndb.synonomy
.
ndb.synonyms
indicates the links between taxa (in this case
validtaxonid
and invalidtaxonid
).
Critically, there is no direct PK/FK link between
these tables. Thus, it is possible for a synonymy at the dataset level
to have no attribution for the synonymy. While ndb.synonyms
also provides the opportunity to define a synonymtype
, the
synonymy
does not, except by relating the
validtaxonid
and invalidtaxonid
in
ndb.synonyms
to the taxonid
and
reftaxonid
of ndb.synonymy
.
The database currently contains 10,457 datasets with synonymys, and a total of 2,822 attributed synonyms. Of the synonyms with associated datastids, there are 853 synonymys without links in the synonyms table. There are 1 synonyms where there is no attributed contactid or publication.
There are 214 synonymys where multiple different publications are used to attribute the synonymy. There are also 528 where multiple different individuals are identified as assigning the synonym. There are 1,464 synonyms without any associated publication.
We use variable IDs (PK: ndb.variables.variableid
) to
link a taxon, the element, context and units. In general, we don’t
expect that these should ever be duplicated, since we can use the same
variable ID over and over again, for the given combination. Having said
that, we do see replication, and it’s not clear why.
In 52 variables we see that there is duplication of the keys in the
variableids
. Interestingly it seems that this is an issue
that primarily affects the mammal records:
The ground sloth Paramylodon harlani
seems to have the
biggest issues. Some possible reasons for this larger issue may be
associated with the ways “specimens” are added to the database,
potentially causing a conflict. This issue should possibly be flagged as
a situation where we could add a composite primary key to the table.
Issues with sites include sites with no associated datasets, duplicated sites and, potentially, sites with missing data.
When we examine sites, we find that there are 704 sites with exactly duplicated site geometries. These sites are distributed globally, and distributed across constituent databases.
Some sites appear to have been submitted, but have no associated collectionunit or dataset data:
This is likely the result of failed uploads during the Tilia upload process (see Tilia upload reference here). To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.
Some sites appear to have been submitted, with collection units but no registered analysis units associated with them:
This is also likely the result of failed uploads during the Tilia upload process. To ensure that these records are properly cleaned we need to validate that there are collection units and datasets associated with the records, and, ultimately, we need to parse the records from Tilia in such a way that we are not committing individual steps within the upload process, but rather, processing the entire file at once.
A total of 5,206 calls to the Tilia API were made since 2024-09-01.