Table: ndb.datasets#
Description#
This table stores the data for Datasets. A Dataset is the set of samples for a particular data type from a Collection Unit. A Collection Unit may have multiple Datasets for different data types, for example one dataset for pollen and another for plant macrofossils. Every Sample is assigned to a Dataset, and every Dataset is assigned to a Collection Unit. Samples from different Collection Units cannot be assigned to the same Dataset (although they may be assigned to Aggregate Datasets).
TODO: Expand this description with: - What data does this table store? - What is the business/research purpose? - How is this data collected or generated? - Are there any important caveats or data quality issues?
Table Structure#
Schema: ndb | Table Comment: This table stores the data for Datasets. A Dataset is the set of samples for a particular data type from a Collection Unit. A Collection Unit may have multiple Datasets for different data types, for example one dataset for pollen and another for plant macrofossils. Every Sample is assigned to a Dataset, and every Dataset is assigned to a Collection Unit. Samples from different Collection Units cannot be assigned to the same Dataset (although they may be assigned to Aggregate Datasets).
Statistics#
| Metric | Value |
|---|---|
| Row Count | 55,439 |
| Total Size | 25 MB |
| Table Size | 7912 kB |
| Indexes Size | 17 MB |
Relationships#
Primary Key: datasetid
Foreign Keys:
collectionunitid→collectionunits.collectionunitiddatasettypeid→datasettypes.datasettypeidembargoid→embargo.embargoid
Referenced By:
TODO: Document which tables reference this table (will be auto-detected in validation).
Data Dictionary#
| Column | Type | Nullable | Default | Constraints | Description |
|---|---|---|---|---|---|
datasetid |
integer | ✗ | nextval('ndb.seq_datasets_d... |
PRIMARY KEY | An arbitrary Dataset identification number. |
collectionunitid |
integer | ✗ | - |
FOREIGN KEY | Collection Unit identification number. Field links to the CollectionUnits table. |
datasettypeid |
integer | ✗ | - |
FOREIGN KEY | Dataset Type identification number. Field links to the DatasetTypes lookup table. |
datasetname |
character varying(80) | ✓ | - |
- | Optional name for the Dataset. |
notes |
text | ✓ | - |
- | Free form notes or comments about the Dataset. |
recdatecreated |
timestamp without time zone | ✗ | timezone('UTC'::text, now()) |
- | |
recdatemodified |
timestamp without time zone | ✗ | - |
- | |
embargoid |
integer | ✓ | - |
FOREIGN KEY |
TODO: Review column descriptions and add comments where missing.
Usage Examples#
Example 1: Basic Selection#
Purpose: Retrieve the 10 most recent records from datasets
Example 2: Count Records#
Purpose: Get the total number of records in datasets
Example 3: Filter by Date Range#
-- Get records within a date range
SELECT *
FROM datasets
WHERE recdatecreated >= '2024-01-01'
AND recdatecreated < '2025-01-01'
ORDER BY recdatecreated DESC;
Purpose: Retrieve records from datasets within a specific date range
Example 4: Join with collectionunits#
-- Join with related table
SELECT
t1.*,
t2.*
FROM datasets t1
INNER JOIN collectionunits t2
ON t1.collectionunitid = t2.collectionunitid
LIMIT 100;
Purpose: Retrieve datasets records with related data from collectionunits
Example 5: Aggregate Data#
-- Aggregate records by collectionunitid
SELECT
collectionunitid,
COUNT(*) as count
FROM datasets
GROUP BY collectionunitid
ORDER BY count DESC
LIMIT 10;
Purpose: Count records grouped by collectionunitid
TODO: Add more specific examples relevant to common research questions or operational tasks.
Data Quality Notes#
Automated Data Quality Tests#
This table is subject to the following automated quality checks:
❌ ref_001: datasets_referenced_by_samples
- Severity: ERROR
- Status: FAILED
-
Description: All datasets should be referenced by at least one sample
-
Suggested Remediation: - Check if samples were never entered for this dataset
- Verify if dataset should be archived/deleted
- Contact data owner for clarification
✅ ref_002: samples_have_valid_datasets
- Severity: ERROR
- Status: PASSED
- Description: All samples must reference valid collection sites
❌ comp_001: datasets_have_investigators
- Severity: WARNING
- Status: FAILED
-
Description: Datasets should have at least one principal investigator
-
Suggested Remediation: - Research and add PI information
- Contact data owner to identify responsible investigator
❌ valid_006: samples_per_analysisunit
- Severity: WARNING
- Status: FAILED
-
Description: Although some datasets may have multiple samples per analysis unit per dataset, we should generally expect that most analysis units have only one set of samples.
-
Suggested Remediation: - Check the dataset to see if the samples are legitimately multiple samples within a single dataset.
- Check with the original publication, or upload data steward.
- Potentially remove duplicate or empty samples if they exist.
See the Data Quality Report for details.
Maintenance#
- Data Owner: TODO: Assign owner
- Update Frequency: TODO: Document frequency
- Last Major Schema Change: TODO: Document when schema last changed
Related Documentation#
TODO: Link to: - Related API endpoints - Data collection procedures - Analysis notebooks or reports that use this table - External ontologies or standards