Skip to content

Table: ndb.datasets#

Description#

This table stores the data for Datasets. A Dataset is the set of samples for a particular data type from a Collection Unit. A Collection Unit may have multiple Datasets for different data types, for example one dataset for pollen and another for plant macrofossils. Every Sample is assigned to a Dataset, and every Dataset is assigned to a Collection Unit. Samples from different Collection Units cannot be assigned to the same Dataset (although they may be assigned to Aggregate Datasets).

TODO: Expand this description with: - What data does this table store? - What is the business/research purpose? - How is this data collected or generated? - Are there any important caveats or data quality issues?

Table Structure#

Visual Schema

Schema: ndb | Table Comment: This table stores the data for Datasets. A Dataset is the set of samples for a particular data type from a Collection Unit. A Collection Unit may have multiple Datasets for different data types, for example one dataset for pollen and another for plant macrofossils. Every Sample is assigned to a Dataset, and every Dataset is assigned to a Collection Unit. Samples from different Collection Units cannot be assigned to the same Dataset (although they may be assigned to Aggregate Datasets).

Statistics#

Metric Value
Row Count 55,439
Total Size 25 MB
Table Size 7912 kB
Indexes Size 17 MB

Relationships#

Primary Key: datasetid

Foreign Keys:

Referenced By:

TODO: Document which tables reference this table (will be auto-detected in validation).

Data Dictionary#

Column Type Nullable Default Constraints Description
datasetid integer nextval('ndb.seq_datasets_d... PRIMARY KEY An arbitrary Dataset identification number.
collectionunitid integer - FOREIGN KEY Collection Unit identification number. Field links to the CollectionUnits table.
datasettypeid integer - FOREIGN KEY Dataset Type identification number. Field links to the DatasetTypes lookup table.
datasetname character varying(80) - - Optional name for the Dataset.
notes text - - Free form notes or comments about the Dataset.
recdatecreated timestamp without time zone timezone('UTC'::text, now()) -
recdatemodified timestamp without time zone - -
embargoid integer - FOREIGN KEY

TODO: Review column descriptions and add comments where missing.

Usage Examples#

Example 1: Basic Selection#

-- Get recent records from datasets
SELECT *
FROM datasets
ORDER BY datasetid DESC
LIMIT 10;

Purpose: Retrieve the 10 most recent records from datasets

Example 2: Count Records#

-- Count total records
SELECT COUNT(*) as total_records
FROM datasets;

Purpose: Get the total number of records in datasets

Example 3: Filter by Date Range#

-- Get records within a date range
SELECT *
FROM datasets
WHERE recdatecreated >= '2024-01-01'
  AND recdatecreated < '2025-01-01'
ORDER BY recdatecreated DESC;

Purpose: Retrieve records from datasets within a specific date range

Example 4: Join with collectionunits#

-- Join with related table
SELECT 
    t1.*,
    t2.*
FROM datasets t1
INNER JOIN collectionunits t2 
    ON t1.collectionunitid = t2.collectionunitid
LIMIT 100;

Purpose: Retrieve datasets records with related data from collectionunits

Example 5: Aggregate Data#

-- Aggregate records by collectionunitid
SELECT 
    collectionunitid,
    COUNT(*) as count
FROM datasets
GROUP BY collectionunitid
ORDER BY count DESC
LIMIT 10;

Purpose: Count records grouped by collectionunitid

TODO: Add more specific examples relevant to common research questions or operational tasks.

Data Quality Notes#

Automated Data Quality Tests#

This table is subject to the following automated quality checks:

❌ ref_001: datasets_referenced_by_samples

  • Severity: ERROR
  • Status: FAILED
  • Description: All datasets should be referenced by at least one sample

  • Suggested Remediation: - Check if samples were never entered for this dataset

  • Verify if dataset should be archived/deleted
  • Contact data owner for clarification

✅ ref_002: samples_have_valid_datasets

  • Severity: ERROR
  • Status: PASSED
  • Description: All samples must reference valid collection sites

❌ comp_001: datasets_have_investigators

  • Severity: WARNING
  • Status: FAILED
  • Description: Datasets should have at least one principal investigator

  • Suggested Remediation: - Research and add PI information

  • Contact data owner to identify responsible investigator

❌ valid_006: samples_per_analysisunit

  • Severity: WARNING
  • Status: FAILED
  • Description: Although some datasets may have multiple samples per analysis unit per dataset, we should generally expect that most analysis units have only one set of samples.

  • Suggested Remediation: - Check the dataset to see if the samples are legitimately multiple samples within a single dataset.

  • Check with the original publication, or upload data steward.
  • Potentially remove duplicate or empty samples if they exist.

See the Data Quality Report for details.

Maintenance#

  • Data Owner: TODO: Assign owner
  • Update Frequency: TODO: Document frequency
  • Last Major Schema Change: TODO: Document when schema last changed

TODO: Link to: - Related API endpoints - Data collection procedures - Analysis notebooks or reports that use this table - External ontologies or standards