14 Site Related Tables
These tables are broadly associated with defining the geospatial position of sampling sites and the properties of these sites. Sites are the central organizing object within Neotoma, and are used to represent the spatial context of any single dataset, or the spatial extent of aggregate datasets. As described in the Database Design Concept, a site contains one or more collection units, each with one or more datasets. Thus, properties such as lakeparameters
, or geopoliticalunits
are associated with the site
, and not directly with the dataset
.
14.1 geopoliticalunits
Lookup table of GeoPoliticalUnits. Table is referenced by the SiteGeoPolitical
table. These are countries and various subdivisions. Countries and subdivisions were acquired from the U.S. Central Intelligence Agency World Factbook6 and the ISO 3166-1 and ISO 3166-2 databases7.
Each GeoPolitical Unit has a rank. GeoPolitical Units with Rank 1 are generally countries. There are a few exceptions, including Antarctica and island territories, such as , which although a Danish territory, is geographically separate and distinct. Rank 2 units are generally secondary political divisions with various designations: e.g. states in the , provinces in , and regions in . For some countries, the secondary divisions are not political but rather distinct geographic entities, such as islands. The secondary divisions of some island nations include either groups of islands or sections of more highly populated islands; however, the actual island on which a site is located is more important information. Some countries also have Rank 3 units, e.g. counties in the and metropolitan departments in . In addition to purely political units, various other administrative regions and geographic entities can be contained in this table. Examples of administrative regions are National Parks and Forests. It might be quite useful, for example, to have a record of all the sites in . These additional units are Rank 4, and they can be added to the database as warranted.
geopoliticalid
(primary key): An arbitrary GeoPolitical identification number.geopoliticalname
: Name of the GeoPolitical Unit, e.g. , .geopoliticalunit
: The name of the unit, e.g. country, state, county, island, governorate, oblast.rank
: The rank of the unit.highergeopoliticalid
: The GeoPoliticalUnit with higher rank, e.g. the country in which a state lies.
14.1.1 SQL Example
This query lists geopoliticalunits
for Canada, along with their political unit name (e.g., “province”), and the rank of that unit. The query uses a recursive CTE, which allows us to move through the table’s internal hierarchy. The recursive query first pulls all rows in which the geopoliticalname
is ‘Canada’, and then accumulates rows as it queries repeatedly to accumulate rows in which the highergeopoliticalid
is contained within the set of geopoliticalid
values in the accumulating table.
WITH RECURSIVE gpuset AS (
SELECT gpu.geopoliticalid,
gpu.geopoliticalname,
gpu.geopoliticalunit,
gpu.rank,
gpu.highergeopoliticalid
FROM ndb.geopoliticalunits AS gpu
WHERE geopoliticalname = 'Canada'
UNION ALL
SELECT gpu.geopoliticalid,
gpu.geopoliticalname,
gpu.geopoliticalunit,
gpu.rank,
gpu.highergeopoliticalid
FROM ndb.geopoliticalunits AS gpu
INNER JOIN gpuset ON gpu.highergeopoliticalid = gpuset.geopoliticalid
)
SELECT geopoliticalid,
geopoliticalname,
geopoliticalunit,
rank
FROM gpuset
ORDER BY rank;
geopoliticalid | geopoliticalname | geopoliticalunit | rank |
---|---|---|---|
756 | Canada | country | 1 |
757 | Alberta | province | 2 |
758 | British Columbia | province | 2 |
759 | Manitoba | province | 2 |
760 | New Brunswick | province | 2 |
761 | Newfoundland and Labrador | province | 2 |
762 | Northwest Territories | territory | 2 |
763 | Nova Scotia | province | 2 |
764 | Nunavut | territory | 2 |
765 | Ontario | province | 2 |
14.1.2 SQL Example
The above query does not sort or structure the data into any sort of hierarchy or internally ordered list. We can use array operators to aggregate the geopoliticalid
s as we move through the hierarchy, so that we add a column gpuhier
that is an integer array, containing each of the levels of the hierarchy. This way we can see that geopoliticalid
10018, “Baffin Island” is an island that is contained within Nunavut, which is a territory within Canada.
WITH RECURSIVE gpuset AS (
SELECT gpu.geopoliticalid,
ARRAY[gpu.geopoliticalid] AS gpuhier,
gpu.geopoliticalname,
gpu.geopoliticalunit,
gpu.rank,
gpu.highergeopoliticalid
FROM ndb.geopoliticalunits AS gpu
WHERE geopoliticalname = 'Canada'
UNION ALL
SELECT gpu.geopoliticalid,
gpuset.gpuhier || gpu.geopoliticalid,
gpu.geopoliticalname,
gpu.geopoliticalunit,
gpu.rank,
gpu.highergeopoliticalid
FROM ndb.geopoliticalunits AS gpu
INNER JOIN gpuset ON gpu.highergeopoliticalid = gpuset.geopoliticalid
)
SELECT geopoliticalid,
gpuhier,
geopoliticalname,
geopoliticalunit,
rank
FROM gpuset
ORDER BY rank;
geopoliticalid | gpuhier | geopoliticalname | geopoliticalunit | rank |
---|---|---|---|---|
756 | {756} | Canada | country | 1 |
757 | {756,757} | Alberta | province | 2 |
758 | {756,758} | British Columbia | province | 2 |
759 | {756,759} | Manitoba | province | 2 |
760 | {756,760} | New Brunswick | province | 2 |
761 | {756,761} | Newfoundland and Labrador | province | 2 |
762 | {756,762} | Northwest Territories | territory | 2 |
763 | {756,763} | Nova Scotia | province | 2 |
764 | {756,764} | Nunavut | territory | 2 |
765 | {756,765} | Ontario | province | 2 |
14.2 lakeparameters
Many sites within Neotoma are lacustrine sites (see depositionalenvironments
). These sites often have associated parameters that can define taphonomic properties for particular dataset types. For example, the literature strongly links basin area to the propoertion of regional and local pollen within a site. Thus lakeparameters
are critical to understanding the depositional environment and context for many datasets. The lakeparameters
table links specific lakeparametertypes
to siteid
, and associated the parameter type with a value.
siteid
(foreign key): THesiteid
to which the parameter is applied.lakeparameterid
(foreign key): The parameter associated with the value.value
: The value for the parameter, in units defined by thelakeparameterid
.
14.2.1 SQL Example
We want to get all lake areas for sites with diatom records where the lake size is over 1km2. Because some lakes have more than one dataset, we aggregate by siteid
, and then aggregate the datasetid
s into an array so that each site is in its own row.
SELECT st.siteid,
st.sitename,
lpt.lakeparameter,
ROUND(lp.value),
ARRAY_AGG(ds.datasetid)
FROM ndb.sites AS st
INNER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
INNER JOIN ndb.datasets AS ds ON ds.collectionunitid = cu.collectionunitid
INNER JOIN ndb.datasettypes AS dst ON dst.datasettypeid = ds.datasettypeid
INNER JOIN ndb.lakeparameters AS lp ON lp.siteid = st.siteid
INNER JOIN ndb.lakeparametertypes AS lpt ON lpt.lakeparameterid = lp.lakeparameterid
WHERE dst.datasettype LIKE 'diatom%'
AND lpt.lakeparametershortname = 'Lake Area'
AND lp.value > 100
GROUP BY st.siteid, st.sitename, lpt.lakeparameter, lp.value
ORDER BY lp.value DESC
LIMIT 12;
siteid | sitename | lakeparameter | round | array_agg |
---|---|---|---|---|
24535 | Lake Okeechobee | Surface area (ha) | 167490 | {54846,50409} |
23000 | Lake Oahe | Surface area (ha) | 134134 | {39735} |
24505 | Lake Champlain | Surface area (ha) | 107360 | {54815,50378} |
13903 | Garças Reservoir | Surface area (ha) | 88156 | {21325} |
22931 | Kentucky Lake | Surface area (ha) | 57598 | {39662} |
24546 | Lake Winnebago | Surface area (ha) | 53989 | {51439,54894} |
9961 | Devils Lake | Surface area (ha) | 44824 | {38861} |
23042 | Livingston Reservoir | Surface area (ha) | 33757 | {39780} |
24630 | Oneida Lake | Surface area (ha) | 20700 | {54985,51530} |
924 | Lake of the Ozarks | Surface area (ha) | 20696 | {39736} |
14.3 lakeparametertypes
The set of distinct lake parameters currently supported by Neotoma as data objects. These include direct measurements of lake basin parameters (area, perimeter, depth) as well as watershed parameters and proportional land use around watersheds. This list can be modified to include new parameters as needed, however these must be added at the database level (rather than through Tilia or the R package).
lakeparameterid
(primary key): Primary key for the lake parameter types.lakeparametercode
: A text string representing a contraction of the full lake parameter name.lakeparametershortname
: A shortened form of the lake parameter name.lakeparameter
: The full lake parameter name with units.
14.3.1 SQL Example
We want to see the coverage of various lake parameters for all records identified as lacustrine within the database. Because depositional environments are stored as a hierarchical table we need to use a recursive CTE to resolve the hierarchy. The CTE dephierarchy
resolves to a table with a row for each of the types of lacustrine depositional environment, along with an array of the “path” required to travel from the depositional environment Lacustrine
to the particular environment name (e.g., Nonspecific Floodplain Lake).
We use the CTE dephierarchy
to generate a total count of all sites that are part of the Lacustrine hierarchy and return that as part of the CTE sitecount
. Finally, we select each of the different lake parameter types, count the number of sites with those records, and divide that value by the total number of lacustrine-type sites in Neotoma.
WITH RECURSIVE dephierarchy AS (
SELECT de.depenvtid,
de.depenvt,
ARRAY[de.depenvtid] AS deagg
FROM ndb.depenvttypes AS de
WHERE de.depenvt = 'Lacustrine'
UNION ALL
SELECT de.depenvtid,
de.depenvt,
dephierarchy.deagg || de.depenvtid
FROM ndb.depenvttypes AS de
INNER JOIN dephierarchy ON de.depenvthigherid = dephierarchy.depenvtid
WHERE NOT de.depenvtid = dephierarchy.depenvtid
),
sitecount AS (
SELECT COUNT(DISTINCT st.siteid) AS total
FROM ndb.sites AS st
INNER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
INNER JOIN dephierarchy AS dh ON dh.depenvtid = cu.depenvtid
)
SELECT lpt.lakeparameter,
ROUND((COUNT(DISTINCT st.siteid)::decimal / (SELECT total FROM sitecount)::decimal) * 100) AS pct
FROM ndb.sites AS st
INNER JOIN ndb.collectionunits AS cu ON cu.siteid = st.siteid
INNER JOIN ndb.depenvttypes AS dev ON dev.depenvtid = cu.depenvtid
INNER JOIN ndb.lakeparameters AS lp ON lp.siteid = st.siteid
INNER JOIN ndb.lakeparametertypes AS lpt ON lpt.lakeparameterid = lp.lakeparameterid
INNER JOIN dephierarchy AS dh ON dh.depenvtid = cu.depenvtid
GROUP BY lpt.lakeparameterid
ORDER BY pct DESC;
lakeparameter | pct |
---|---|
Surface area (ha) | 26 |
Maximum depth (m) | 25 |
Watershed area (ha) | 17 |
Shoreline length (km) | 16 |
Watershed forest (%) | 13 |
Watershed water coverage excluding sample lake (%) | 12 |
Watershed barren land (%) | 12 |
Watershed agricultural/range (%) | 12 |
Watershed urban (non-residential and residential) (%) | 12 |
Watershed wetlands (%) | 11 |
14.4 sitegeopolitical
This table lists the GeoPolitical units in which sites occur. A site may appear in the table multiple times, for example, some sites may be listed as being in both “Alsace” and “France”, so be sure to use terms such as DISTINCT
or provide some post-processing to your data returns a single instance of each site if you are doing querying across multiple geopolitical ranks.
sitegeopoliticalid
(primary key): An arbitrary Site GeoPolitical identification number.siteid
(foreign key): Site identification number. Field links to theSites
table.geopoliticalid
(foreign key): GeoPolitical identification number. Field links to theGeoPoliticalUnits
lookup table.
14.4.1 SQL Example
The following query provides a short list of sites at each geopolitical level, for sites in France, using the sitegeopolitical
and geopoliticalunits
. We make use of the array_agg()
aggregator to produce an array of siteids for each geopolitical unit, and we limit the return to a maximum of five elements per unit.
WITH RECURSIVE gpuset AS (
SELECT gpu.geopoliticalid,
gpu.highergeopoliticalid
FROM ndb.geopoliticalunits AS gpu
WHERE geopoliticalname = 'France'
UNION ALL
SELECT gpu.geopoliticalid,
gpu.highergeopoliticalid
FROM ndb.geopoliticalunits AS gpu
INNER JOIN gpuset ON gpu.highergeopoliticalid = gpuset.geopoliticalid
)
SELECT (array_agg(sgp.siteid))[1:5],
gpu.geopoliticalname
FROM gpuset
INNER JOIN ndb.sitegeopolitical AS sgp ON gpuset.geopoliticalid = sgp.geopoliticalid
INNER JOIN ndb.geopoliticalunits AS gpu ON gpu.geopoliticalid = sgp.geopoliticalid
GROUP BY gpu.geopoliticalname
LIMIT 10;
array_agg | geopoliticalname |
---|---|
{3091,16128,16143} | Ain |
{363,15897} | Aisne |
{2969,3046,3235,3416,27388} | Alpes-de-Haute-Provence |
{3016,3244,3301,3394,16179} | Alpes-Maritimes |
{2988,16147} | Alsace |
{3427,3428,16165} | Aquitaine |
{28277} | Aude |
{2972,3048,3049,3050,3122} | Auvergne |
{15855,15857} | Aveyron |
{3110,3111,3113,16180,16181} | Basse-Normandie |
14.5 siteimages
This table stores hyperlinks to jpeg images of sites. At present there are very few records with stored site images (9 records, as of June 2022), and the links appear to be largely dead. The table is retained since it may provide utility in the future.
siteimageid
(primary key): An arbitrary Site Image identification number.siteid
(foreign key): Site identification number. Field links to theSites
table.contactid
(foreign key): Contact identification number for image attribution from theContacts
table.caption
: Caption for the image.credit
: Credit for the image. If null, the credit is formed from the ContactID.date
: Date of photograph or image.siteimage
: Hyperlink to a URL for the image.
14.6 sites
The sites
table stores information about the geographic information specific to a sampling location, including the name, geographic bounding box, spatial polygon, and site description. Sites generally have an areal extent and can be circumscribed by a latitude-longitude box (using the latitudenorth
and other numeric bounding variables), but the geospatial object is also stored as a PostGIS geog
object using a WGS84 projection. Site data ingested from legacy databases may have included only point locations, and additionally, may have been recorded using DMS notation, meaning the decimal expression may appear overly accurate.
The lat-long box can be used either to circumscribe the areal extent of a site or to provide purposeful imprecision to the site location. Site location may be imprecise because of the original description was vague, e.g. «a gravel bar 5 miles east of town», or because the investigators, land owner, or land management agency may not want the exact location made public, perhaps to prevent looting and vandalism. In the first case, the lat-long box can be made sufficiently large to encompass the true location and in the second case to prevent exact location.
siteid
(Primary Key): An arbitrary Site identification number.sitename
: Name of the site.- Alternative names, including archaeological site numbers, are placed in square brackets, for example: “New #4 [Lloyd’s Rock Hole]” and “Modoc Rock Shelter [11RA501]”.
- A search of the
sitename
field for any of the alternative names or for the archaeological site number will find the site. Some archaeological sites are known only by their site number. Modifiers to site names are placed in parentheses. Authors are added for generic sites names, especially for surface samples, that are duplicated in the database, for example: “Site 1 (Heusser 1978)”, “Site 1 (Delcourt et al. 1983)”, “Site 1 (Elliot-Fisk et al. 1982)”, “Site 1 (Whitehead and Jackson 1990)” - Historically, when sitenames were duplicated in the database the name was followed by the 2-letter country code and state or province, for example: “Big Lake (US:Maine)”, and “Big Lake (US:Arkansas)”. This past practice has declined as the number of database stewards has increased and the capacity to provide improved geospatial searching has improved.
longitudeeast
: East bounding longitude for a site. Currently generated from an internal trigger through PostGIS to extract the bounding box based on the object spatial polygon.latitudenorth
: North bounding latitude for a site.longitudewest
: West bounding longitude for a site.latitudesouth
: South bounding latitude for a site.altitude
: Altitude of a site in meters. NOTE Altitude may be positive or negative, but should not reflect the depth of the sample site within a basin (ocean or lake). This is described by thecollectionunits
waterdepth
field.area
: Area of a site in hectares.sitedescription
: Free form description of a site, including such information as physiography and vegetation around the site.notes
: Free form notes or comments about the site.
14.6.1 SQL Example
Site names are often repeated, because geographic names are not required to be unique. Here we examine sites with common names and use PostGIS to calculate the minimum distance between sites with similar names.
First we use a CTE to clean the names of the sites. Given the sitename
protocol described above, we remove the terminal bracketed term from all site names using regexp_replace()
. We place this into a table called cleannames
that contains siteid
, the cleaned name and the geog
field, the geographic object representing the spatial location of the site. From here, we join the table on itself (as name_one
and name_two
), linking the two tables by common cleanname
. We also apply a WHERE
statement that applies the condition name_one.siteid < name_two.siteid
that makes sure we’re not calculating self-distance, or calculating distances twice (e.g., for site (1 -> 2) and site (2 -> 1)).
The actual inter-site distance is calculated using the PostGIS function ST_Distance
, which returns distance in meters. We convert to km using a combination of rounding and multiplication.
WITH cleannames AS (
SELECT st.siteid, regexp_replace(st.sitename, ' \(.*$', '') AS cleanname, st.geog
FROM ndb.sites AS st
WHERE st.sitename LIKE '%\(%' ESCAPE '\'
)
SELECT name_one.cleanname,
ROUND(MIN(ST_distance(name_one.geog, name_two.geog)) / 100) / 10 AS closest,
COUNT(*) + 1 AS sites
FROM cleannames AS name_one
INNER JOIN cleannames AS name_two ON name_one.cleanname = name_two.cleanname
WHERE name_one.siteid < name_two.siteid
GROUP BY name_one.cleanname
ORDER BY sites DESC, cleanname
LIMIT 15;
cleanname | closest | sites |
---|---|---|
Site 1 | 33.6 | 211 |
Site 2 | 77.6 | 106 |
Site 3 | 39.5 | 106 |
Old Crow Flats | 0.0 | 79 |
Site 4 | 131.7 | 79 |
Site 5 | 305.2 | 79 |
Site 6 | 221.6 | 67 |
St. Catherines Island | 0.0 | 67 |
Site 11 | 230.0 | 56 |
Site 7 | 102.6 | 56 |