Table: `ndb.datasets`#

Description#

This table stores the data for Datasets. A Dataset is the set of samples for a particular data type from a Collection Unit. A Collection Unit may have multiple Datasets for different data types, for example one dataset for pollen and another for plant macrofossils. Every Sample is assigned to a Dataset, and every Dataset is assigned to a Collection Unit. Samples from different Collection Units cannot be assigned to the same Dataset (although they may be assigned to Aggregate Datasets).

TODO: Expand this description with: - What data does this table store? - What is the business/research purpose? - How is this data collected or generated? - Are there any important caveats or data quality issues?

Table Structure#

Schema: ndb | Table Comment: This table stores the data for Datasets. A Dataset is the set of samples for a particular data type from a Collection Unit. A Collection Unit may have multiple Datasets for different data types, for example one dataset for pollen and another for plant macrofossils. Every Sample is assigned to a Dataset, and every Dataset is assigned to a Collection Unit. Samples from different Collection Units cannot be assigned to the same Dataset (although they may be assigned to Aggregate Datasets).

Statistics#

Metric	Value
Row Count	55,439
Total Size	25 MB
Table Size	7912 kB
Indexes Size	17 MB

Relationships#

Primary Key: datasetid

Foreign Keys:

collectionunitid → collectionunits.collectionunitid
datasettypeid → datasettypes.datasettypeid
embargoid → embargo.embargoid

Referenced By:

TODO: Document which tables reference this table (will be auto-detected in validation).

Data Dictionary#

Column	Type	Nullable	Default	Constraints	Description
`datasetid`	integer	✗	`nextval('ndb.seq_datasets_d...`	PRIMARY KEY	An arbitrary Dataset identification number.
`collectionunitid`	integer	✗	`-`	FOREIGN KEY	Collection Unit identification number. Field links to the CollectionUnits table.
`datasettypeid`	integer	✗	`-`	FOREIGN KEY	Dataset Type identification number. Field links to the DatasetTypes lookup table.
`datasetname`	character varying(80)	✓	`-`	-	Optional name for the Dataset.
`notes`	text	✓	`-`	-	Free form notes or comments about the Dataset.
`recdatecreated`	timestamp without time zone	✗	`timezone('UTC'::text, now())`	-
`recdatemodified`	timestamp without time zone	✗	`-`	-
`embargoid`	integer	✓	`-`	FOREIGN KEY

TODO: Review column descriptions and add comments where missing.

Usage Examples#

Example 1: Basic Selection#

-- Get recent records from datasets
SELECT *
FROM datasets
ORDER BY datasetid DESC
LIMIT 10;

Purpose: Retrieve the 10 most recent records from datasets

Example 2: Count Records#

-- Count total records
SELECT COUNT(*) as total_records
FROM datasets;

Purpose: Get the total number of records in datasets

Example 3: Filter by Date Range#

-- Get records within a date range
SELECT *
FROM datasets
WHERE recdatecreated >= '2024-01-01'
  AND recdatecreated < '2025-01-01'
ORDER BY recdatecreated DESC;

Purpose: Retrieve records from datasets within a specific date range

Example 4: Join with collectionunits#

-- Join with related table
SELECT 
    t1.*,
    t2.*
FROM datasets t1
INNER JOIN collectionunits t2 
    ON t1.collectionunitid = t2.collectionunitid
LIMIT 100;

Purpose: Retrieve datasets records with related data from collectionunits

Example 5: Aggregate Data#

-- Aggregate records by collectionunitid
SELECT 
    collectionunitid,
    COUNT(*) as count
FROM datasets
GROUP BY collectionunitid
ORDER BY count DESC
LIMIT 10;

Purpose: Count records grouped by collectionunitid

TODO: Add more specific examples relevant to common research questions or operational tasks.

Data Quality Notes#

Automated Data Quality Tests#

This table is subject to the following automated quality checks:

❌ ref_001: datasets_referenced_by_samples

Severity: ERROR
Status: FAILED
Description: All datasets should be referenced by at least one sample
Suggested Remediation: - Check if samples were never entered for this dataset
Verify if dataset should be archived/deleted
Contact data owner for clarification

✅ ref_002: samples_have_valid_datasets

Severity: ERROR
Status: PASSED
Description: All samples must reference valid collection sites

❌ comp_001: datasets_have_investigators

Severity: WARNING
Status: FAILED
Description: Datasets should have at least one principal investigator
Suggested Remediation: - Research and add PI information
Contact data owner to identify responsible investigator

❌ valid_006: samples_per_analysisunit

Severity: WARNING
Status: FAILED
Description: Although some datasets may have multiple samples per analysis unit per dataset, we should generally expect that most analysis units have only one set of samples.
Suggested Remediation: - Check the dataset to see if the samples are legitimately multiple samples within a single dataset.
Check with the original publication, or upload data steward.
Potentially remove duplicate or empty samples if they exist.

See the Data Quality Report for details.

Maintenance#

Data Owner: TODO: Assign owner
Update Frequency: TODO: Document frequency
Last Major Schema Change: TODO: Document when schema last changed

Table: ndb.datasets#