14 Site Related Tables

These tables are broadly associated with defining the geospatial position of sampling sites and the properties of these sites. Sites are the central organizing object within Neotoma, and are used to represent the spatial context of any single dataset, or the spatial extent of aggregate datasets. As described in the Database Design Concept, a site contains one or more collection units, each with one or more datasets. Thus, properties such as lakeparameters, or geopoliticalunits are associated with the site, and not directly with the dataset.

14.1 `geopoliticalunits`

Lookup table of GeoPoliticalUnits. Table is referenced by the SiteGeoPolitical table. These are countries and various subdivisions. Countries and subdivisions were acquired from the U.S. Central Intelligence Agency World Factbook⁶ and the ISO 3166-1 and ISO 3166-2 databases⁷.

Each GeoPolitical Unit has a rank. GeoPolitical Units with Rank 1 are generally countries. There are a few exceptions, including Antarctica and island territories, such as , which although a Danish territory, is geographically separate and distinct. Rank 2 units are generally secondary political divisions with various designations: e.g. states in the , provinces in , and regions in . For some countries, the secondary divisions are not political but rather distinct geographic entities, such as islands. The secondary divisions of some island nations include either groups of islands or sections of more highly populated islands; however, the actual island on which a site is located is more important information. Some countries also have Rank 3 units, e.g. counties in the and metropolitan departments in . In addition to purely political units, various other administrative regions and geographic entities can be contained in this table. Examples of administrative regions are National Parks and Forests. It might be quite useful, for example, to have a record of all the sites in . These additional units are Rank 4, and they can be added to the database as warranted.

geopoliticalid (primary key): An arbitrary GeoPolitical identification number.
geopoliticalname: Name of the GeoPolitical Unit, e.g. , .
geopoliticalunit: The name of the unit, e.g. country, state, county, island, governorate, oblast.
rank: The rank of the unit.
highergeopoliticalid: The GeoPoliticalUnit with higher rank, e.g. the country in which a state lies.

14.1.1 SQL Example

This query lists geopoliticalunits for Canada, along with their political unit name (e.g., “province”), and the rank of that unit. The query uses a recursive CTE, which allows us to move through the table’s internal hierarchy. The recursive query first pulls all rows in which the geopoliticalname is ‘Canada’, and then accumulates rows as it queries repeatedly to accumulate rows in which the highergeopoliticalid is contained within the set of geopoliticalid values in the accumulating table.

WITH RECURSIVE gpuset AS (
  SELECT gpu.geopoliticalid,
         gpu.geopoliticalname,
         gpu.geopoliticalunit,
         gpu.rank,
         gpu.highergeopoliticalid
  FROM ndb.geopoliticalunits AS gpu
  WHERE geopoliticalname = 'Canada'
  UNION ALL
  SELECT gpu.geopoliticalid, 
         gpu.geopoliticalname, 
         gpu.geopoliticalunit,
         gpu.rank,
         gpu.highergeopoliticalid 
  FROM ndb.geopoliticalunits AS gpu
  INNER JOIN gpuset ON gpu.highergeopoliticalid = gpuset.geopoliticalid
)
SELECT geopoliticalid, 
       geopoliticalname,
       geopoliticalunit,
       rank
FROM gpuset
ORDER BY rank;

Table 14.1: Displaying records 1 - 10
geopoliticalid	geopoliticalname	geopoliticalunit	rank
756	Canada	country	1
757	Alberta	province	2
758	British Columbia	province	2
759	Manitoba	province	2
760	New Brunswick	province	2
761	Newfoundland and Labrador	province	2
762	Northwest Territories	territory	2
763	Nova Scotia	province	2
764	Nunavut	territory	2
765	Ontario	province	2

14.1.2 SQL Example

The above query does not sort or structure the data into any sort of hierarchy or internally ordered list. We can use array operators to aggregate the geopoliticalids as we move through the hierarchy, so that we add a column gpuhier that is an integer array, containing each of the levels of the hierarchy. This way we can see that geopoliticalid 10018, “Baffin Island” is an island that is contained within Nunavut, which is a territory within Canada.

WITH RECURSIVE gpuset AS (
  SELECT gpu.geopoliticalid,
         ARRAY[gpu.geopoliticalid] AS gpuhier,
         gpu.geopoliticalname,
         gpu.geopoliticalunit,
         gpu.rank,
         gpu.highergeopoliticalid
  FROM ndb.geopoliticalunits AS gpu
  WHERE geopoliticalname = 'Canada'
  UNION ALL
  SELECT gpu.geopoliticalid, 
         gpuset.gpuhier || gpu.geopoliticalid,
         gpu.geopoliticalname, 
         gpu.geopoliticalunit,
         gpu.rank,
         gpu.highergeopoliticalid 
  FROM ndb.geopoliticalunits AS gpu
  INNER JOIN gpuset ON gpu.highergeopoliticalid = gpuset.geopoliticalid
)
SELECT geopoliticalid, 
       gpuhier,
       geopoliticalname,
       geopoliticalunit,
       rank
FROM gpuset
ORDER BY rank;

Table 14.2: Displaying records 1 - 10
geopoliticalid	gpuhier	geopoliticalname	geopoliticalunit	rank
756	{756}	Canada	country	1
757	{756,757}	Alberta	province	2
758	{756,758}	British Columbia	province	2
759	{756,759}	Manitoba	province	2
760	{756,760}	New Brunswick	province	2
761	{756,761}	Newfoundland and Labrador	province	2
762	{756,762}	Northwest Territories	territory	2
763	{756,763}	Nova Scotia	province	2
764	{756,764}	Nunavut	territory	2
765	{756,765}	Ontario	province	2

14.2 `lakeparameters`

Many sites within Neotoma are lacustrine sites (see depositionalenvironments). These sites often have associated parameters that can define taphonomic properties for particular dataset types. For example, the literature strongly links basin area to the propoertion of regional and local pollen within a site. Thus lakeparameters are critical to understanding the depositional environment and context for many datasets. The lakeparameters table links specific lakeparametertypes to siteid, and associated the parameter type with a value.

siteid (foreign key): THe siteid to which the parameter is applied.
lakeparameterid (foreign key): The parameter associated with the value.
value: The value for the parameter, in units defined by the lakeparameterid.

14.2.1 SQL Example

We want to get all lake areas for sites with diatom records where the lake size is over 1km². Because some lakes have more than one dataset, we aggregate by siteid, and then aggregate the datasetids into an array so that each site is in its own row.

SELECT st.siteid, 
       st.sitename,
       lpt.lakeparameter,
       ROUND(lp.value),
       ARRAY_AGG(ds.datasetid)
FROM ndb.sites AS st 
INNER JOIN ndb.collectionunits    AS  cu ON           cu.siteid = st.siteid
INNER JOIN ndb.datasets           AS  ds ON ds.collectionunitid = cu.collectionunitid
INNER JOIN ndb.datasettypes       AS dst ON   dst.datasettypeid = ds.datasettypeid 
INNER JOIN ndb.lakeparameters     AS  lp ON           lp.siteid = st.siteid 
INNER JOIN ndb.lakeparametertypes AS lpt ON lpt.lakeparameterid = lp.lakeparameterid
WHERE dst.datasettype LIKE 'diatom%'
AND lpt.lakeparametershortname = 'Lake Area'
AND lp.value > 100
GROUP BY st.siteid, st.sitename, lpt.lakeparameter, lp.value
ORDER BY lp.value DESC
LIMIT 12;

Table 14.3: Displaying records 1 - 10
siteid	sitename	lakeparameter	round	array_agg
24535	Lake Okeechobee	Surface area (ha)	167490	{54846,50409}
23000	Lake Oahe	Surface area (ha)	134134	{39735}
24505	Lake Champlain	Surface area (ha)	107360	{54815,50378}
13903	Garças Reservoir	Surface area (ha)	88156	{21325}
22931	Kentucky Lake	Surface area (ha)	57598	{39662}
24546	Lake Winnebago	Surface area (ha)	53989	{51439,54894}
9961	Devils Lake	Surface area (ha)	44824	{38861}
23042	Livingston Reservoir	Surface area (ha)	33757	{39780}
24630	Oneida Lake	Surface area (ha)	20700	{54985,51530}
924	Lake of the Ozarks	Surface area (ha)	20696	{39736}

14.3 `lakeparametertypes`

The set of distinct lake parameters currently supported by Neotoma as data objects. These include direct measurements of lake basin parameters (area, perimeter, depth) as well as watershed parameters and proportional land use around watersheds. This list can be modified to include new parameters as needed, however these must be added at the database level (rather than through Tilia or the R package).

lakeparameterid (primary key): Primary key for the lake parameter types.
lakeparametercode: A text string representing a contraction of the full lake parameter name.
lakeparametershortname: A shortened form of the lake parameter name.
lakeparameter: The full lake parameter name with units.

14.3.1 SQL Example

We want to see the coverage of various lake parameters for all records identified as lacustrine within the database. Because depositional environments are stored as a hierarchical table we need to use a recursive CTE to resolve the hierarchy. The CTE dephierarchy resolves to a table with a row for each of the types of lacustrine depositional environment, along with an array of the “path” required to travel from the depositional environment Lacustrine to the particular environment name (e.g., Nonspecific Floodplain Lake).

We use the CTE dephierarchy to generate a total count of all sites that are part of the Lacustrine hierarchy and return that as part of the CTE sitecount. Finally, we select each of the different lake parameter types, count the number of sites with those records, and divide that value by the total number of lacustrine-type sites in Neotoma.

WITH RECURSIVE dephierarchy AS (
  SELECT de.depenvtid,
         de.depenvt,
         ARRAY[de.depenvtid] AS deagg
  FROM ndb.depenvttypes AS de
  WHERE de.depenvt = 'Lacustrine'
  UNION ALL 
  SELECT de.depenvtid,
         de.depenvt,
         dephierarchy.deagg || de.depenvtid
  FROM ndb.depenvttypes AS de 
  INNER JOIN dephierarchy ON de.depenvthigherid = dephierarchy.depenvtid
  WHERE NOT de.depenvtid = dephierarchy.depenvtid
),
sitecount AS (
       SELECT COUNT(DISTINCT st.siteid) AS total
       FROM ndb.sites AS st
       INNER JOIN ndb.collectionunits AS cu ON    cu.siteid = st.siteid
       INNER JOIN        dephierarchy AS dh ON dh.depenvtid = cu.depenvtid   
)
SELECT lpt.lakeparameter, 
       ROUND((COUNT(DISTINCT st.siteid)::decimal / (SELECT total FROM sitecount)::decimal) * 100) AS pct
FROM ndb.sites AS st
INNER JOIN     ndb.collectionunits AS  cu ON           cu.siteid = st.siteid
INNER JOIN        ndb.depenvttypes AS dev ON       dev.depenvtid = cu.depenvtid
INNER JOIN      ndb.lakeparameters AS  lp ON           lp.siteid = st.siteid
INNER JOIN  ndb.lakeparametertypes AS lpt ON lpt.lakeparameterid = lp.lakeparameterid
INNER JOIN            dephierarchy AS  dh ON        dh.depenvtid = cu.depenvtid
GROUP BY lpt.lakeparameterid
ORDER BY pct DESC;

Table 14.4: Displaying records 1 - 10
lakeparameter	pct
Surface area (ha)	26
Maximum depth (m)	25
Watershed area (ha)	17
Shoreline length (km)	16
Watershed forest (%)	13
Watershed water coverage excluding sample lake (%)	12
Watershed barren land (%)	12
Watershed agricultural/range (%)	12
Watershed urban (non-residential and residential) (%)	12
Watershed wetlands (%)	11

14.4 `sitegeopolitical`

This table lists the GeoPolitical units in which sites occur. A site may appear in the table multiple times, for example, some sites may be listed as being in both “Alsace” and “France”, so be sure to use terms such as DISTINCT or provide some post-processing to your data returns a single instance of each site if you are doing querying across multiple geopolitical ranks.

sitegeopoliticalid (primary key): An arbitrary Site GeoPolitical identification number.
siteid (foreign key): Site identification number. Field links to the Sites table.
geopoliticalid (foreign key): GeoPolitical identification number. Field links to the GeoPoliticalUnits lookup table.

14.4.1 SQL Example

The following query provides a short list of sites at each geopolitical level, for sites in France, using the sitegeopolitical and geopoliticalunits. We make use of the array_agg() aggregator to produce an array of siteids for each geopolitical unit, and we limit the return to a maximum of five elements per unit.

WITH RECURSIVE gpuset AS (
  SELECT gpu.geopoliticalid,
         gpu.highergeopoliticalid
  FROM ndb.geopoliticalunits AS gpu
  WHERE geopoliticalname = 'France'
  UNION ALL
  SELECT gpu.geopoliticalid, 
         gpu.highergeopoliticalid 
  FROM ndb.geopoliticalunits AS gpu
  INNER JOIN gpuset ON gpu.highergeopoliticalid = gpuset.geopoliticalid
)
SELECT (array_agg(sgp.siteid))[1:5],
       gpu.geopoliticalname
FROM gpuset
INNER JOIN  ndb.sitegeopolitical AS sgp ON gpuset.geopoliticalid = sgp.geopoliticalid
INNER JOIN ndb.geopoliticalunits AS gpu ON    gpu.geopoliticalid = sgp.geopoliticalid
GROUP BY gpu.geopoliticalname
LIMIT 10;

Table 14.5: Displaying records 1 - 10
array_agg	geopoliticalname
{3091,16128,16143}	Ain
{363,15897}	Aisne
{2969,3046,3235,3416,27388}	Alpes-de-Haute-Provence
{3016,3244,3301,3394,16179}	Alpes-Maritimes
{2988,16147}	Alsace
{3427,3428,16165}	Aquitaine
{28277}	Aude
{2972,3048,3049,3050,3122}	Auvergne
{15855,15857}	Aveyron
{3110,3111,3113,16180,16181}	Basse-Normandie

14.5 `siteimages`

This table stores hyperlinks to jpeg images of sites. At present there are very few records with stored site images (9 records, as of June 2022), and the links appear to be largely dead. The table is retained since it may provide utility in the future.

siteimageid (primary key): An arbitrary Site Image identification number.
siteid (foreign key): Site identification number. Field links to the Sites table.
contactid (foreign key): Contact identification number for image attribution from the Contacts table.
caption: Caption for the image.
credit: Credit for the image. If null, the credit is formed from the ContactID.
date: Date of photograph or image.
siteimage: Hyperlink to a URL for the image.

14.6 `sites`

The sites table stores information about the geographic information specific to a sampling location, including the name, geographic bounding box, spatial polygon, and site description. Sites generally have an areal extent and can be circumscribed by a latitude-longitude box (using the latitudenorth and other numeric bounding variables), but the geospatial object is also stored as a PostGIS geog object using a WGS84 projection. Site data ingested from legacy databases may have included only point locations, and additionally, may have been recorded using DMS notation, meaning the decimal expression may appear overly accurate.

The lat-long box can be used either to circumscribe the areal extent of a site or to provide purposeful imprecision to the site location. Site location may be imprecise because of the original description was vague, e.g. «a gravel bar 5 miles east of town», or because the investigators, land owner, or land management agency may not want the exact location made public, perhaps to prevent looting and vandalism. In the first case, the lat-long box can be made sufficiently large to encompass the true location and in the second case to prevent exact location.

siteid (Primary Key): An arbitrary Site identification number.
sitename: Name of the site.
- Alternative names, including archaeological site numbers, are placed in square brackets, for example: “New #4 [Lloyd’s Rock Hole]” and “Modoc Rock Shelter [11RA501]”.
- A search of the sitename field for any of the alternative names or for the archaeological site number will find the site. Some archaeological sites are known only by their site number. Modifiers to site names are placed in parentheses. Authors are added for generic sites names, especially for surface samples, that are duplicated in the database, for example: “Site 1 (Heusser 1978)”, “Site 1 (Delcourt et al. 1983)”, “Site 1 (Elliot-Fisk et al. 1982)”, “Site 1 (Whitehead and Jackson 1990)”
- Historically, when sitenames were duplicated in the database the name was followed by the 2-letter country code and state or province, for example: “Big Lake (US:Maine)”, and “Big Lake (US:Arkansas)”. This past practice has declined as the number of database stewards has increased and the capacity to provide improved geospatial searching has improved.
longitudeeast: East bounding longitude for a site. Currently generated from an internal trigger through PostGIS to extract the bounding box based on the object spatial polygon.
latitudenorth: North bounding latitude for a site.
longitudewest: West bounding longitude for a site.
latitudesouth: South bounding latitude for a site.
altitude: Altitude of a site in meters. NOTE Altitude may be positive or negative, but should not reflect the depth of the sample site within a basin (ocean or lake). This is described by the collectionunits waterdepth field.
area: Area of a site in hectares.
sitedescription: Free form description of a site, including such information as physiography and vegetation around the site.
notes: Free form notes or comments about the site.

14.6.1 SQL Example

Site names are often repeated, because geographic names are not required to be unique. Here we examine sites with common names and use PostGIS to calculate the minimum distance between sites with similar names.

First we use a CTE to clean the names of the sites. Given the sitename protocol described above, we remove the terminal bracketed term from all site names using regexp_replace(). We place this into a table called cleannames that contains siteid, the cleaned name and the geog field, the geographic object representing the spatial location of the site. From here, we join the table on itself (as name_one and name_two), linking the two tables by common cleanname. We also apply a WHERE statement that applies the condition name_one.siteid < name_two.siteid that makes sure we’re not calculating self-distance, or calculating distances twice (e.g., for site (1 -> 2) and site (2 -> 1)).

The actual inter-site distance is calculated using the PostGIS function ST_Distance, which returns distance in meters. We convert to km using a combination of rounding and multiplication.

WITH cleannames AS (
       SELECT st.siteid, regexp_replace(st.sitename, ' \(.*$', '') AS cleanname, st.geog
       FROM ndb.sites AS st
       WHERE st.sitename LIKE '%\(%' ESCAPE '\'
)
SELECT name_one.cleanname,
       ROUND(MIN(ST_distance(name_one.geog, name_two.geog)) / 100) / 10 AS closest,
       COUNT(*) + 1 AS sites
FROM cleannames AS name_one
INNER JOIN cleannames AS name_two ON name_one.cleanname = name_two.cleanname
WHERE name_one.siteid < name_two.siteid
GROUP BY name_one.cleanname
ORDER BY sites DESC, cleanname
LIMIT 15;

Table 14.6: Displaying records 1 - 10
cleanname	closest	sites
Site 1	33.6	211
Site 2	77.6	106
Site 3	39.5	106
Old Crow Flats	0.0	79
Site 4	131.7	79
Site 5	305.2	79
Site 6	221.6	67
St. Catherines Island	0.0	67
Site 11	230.0	56
Site 7	102.6	56

14 Site Related Tables

14.1 geopoliticalunits