Data Quality Report#
Generated: 2025-12-01 16:46:52
Summary#
- Total Tests: 16
- Passed: 6 ✓
- Failed: 10 ✗
- Pass Rate: 37.5%
Referential Integrity#
Pass Rate: 2/4
❌ Failed Tests#
ref_001: datasets_referenced_by_samples#
Severity: ERROR
Description: All datasets should be referenced by at least one sample
Affected Tables: datasets, samples
Error: AssertionError: ================================================================================ Test Failed: datasets_referenced_by_samples (ref_001) Category: referential_integrity ================================================================================ Description: All datasets should be referenced by at least one sample
Rationale: Every dataset should have associated samples. A dataset without samples indicates either incomplete data entry or orphaned records.
Expected: No violations Found: 99 violations
Sample violations:
RealDictRow({'datasetid': 16232, 'datasetname': 'Rio Dell Assemblage'}) RealDictRow({'datasetid': 18052, 'datasetname': 'Halifax Lakes'}) RealDictRow({'datasetid': 18053, 'datasetname': 'Halifax Lakes'}) RealDictRow({'datasetid': 21425, 'datasetname': 'gravity corer'}) RealDictRow({'datasetid': 21673, 'datasetname': 'Rusk County ANSP'}) RealDictRow({'datasetid': 22700, 'datasetname': None}) RealDictRow({'datasetid': 22702, 'datasetname': None}) RealDictRow({'datasetid': 24307, 'datasetname': 'EPD E# 1080'}) RealDictRow({'datasetid': 24309, 'datasetname': 'EPD E# 1082'}) RealDictRow({'datasetid': 24310, 'datasetname': 'EPD E# 1082'}) assert 99 == 0 + where 99 = len([RealDictRow({'datasetid': 16232, 'datasetname': 'Rio Dell Assemblage'}), RealDictRow({'datasetid': 18052, 'datasetname': 'Halifax Lakes'}), RealDictRow({'datasetid': 18053, 'datasetname': 'Halifax Lakes'}), RealDictRow({'datasetid': 21425, 'datasetname': 'gravity corer'}), RealDictRow({'datasetid': 21673, 'datasetname': 'Rusk County ANSP'}), RealDictRow({'datasetid': 22700, 'datasetname': None}), ...])
Rationale: Every dataset should have associated samples. A dataset without samples indicates either incomplete data entry or orphaned records.
Remediation: - Check if samples were never entered for this dataset - Verify if dataset should be archived/deleted - Contact data owner for clarification
ref_003: sites_have_collection_units#
Severity: ERROR
Description: All sites should have at least one collection unit.
Affected Tables: sites, collectionunits
Error: AssertionError: ================================================================================ Test Failed: sites_have_collection_units (ref_003) Category: referential_integrity ================================================================================ Description: All sites should have at least one collection unit.
Rationale: All sites should have one collection unit from which samples are obtained.
Expected: No violations Found: 81 violations
Sample violations:
RealDictRow({'siteid': 30939, 'collectionunitid': None}) RealDictRow({'siteid': 30940, 'collectionunitid': None}) RealDictRow({'siteid': 30942, 'collectionunitid': None}) RealDictRow({'siteid': 30953, 'collectionunitid': None}) RealDictRow({'siteid': 30999, 'collectionunitid': None}) RealDictRow({'siteid': 31292, 'collectionunitid': None}) RealDictRow({'siteid': 31408, 'collectionunitid': None}) RealDictRow({'siteid': 31409, 'collectionunitid': None}) RealDictRow({'siteid': 31410, 'collectionunitid': None}) RealDictRow({'siteid': 31411, 'collectionunitid': None}) assert 81 == 0 + where 81 = len([RealDictRow({'siteid': 30939, 'collectionunitid': None}), RealDictRow({'siteid': 30940, 'collectionunitid': None}), RealDictRow({'siteid': 30942, 'collectionunitid': None}), RealDictRow({'siteid': 30953, 'collectionunitid': None}), RealDictRow({'siteid': 30999, 'collectionunitid': None}), RealDictRow({'siteid': 31292, 'collectionunitid': None}), ...])
Rationale: All sites should have one collection unit from which samples are obtained.
Remediation: - Remove "floating" sites. - Ensure that the collection units have not been accidentally deleted.
✅ Passed Tests#
- ref_002: samples_have_valid_datasets
- ref_004: valid_taxa_need_highertaxonids
Data Completeness#
Pass Rate: 1/4
❌ Failed Tests#
comp_001: datasets_have_investigators#
Severity: WARNING
Description: Datasets should have at least one principal investigator
Affected Tables: datasets, datasetpis
Error: AssertionError: ================================================================================ Test Failed: datasets_have_investigators (comp_001) Category: data_completeness ================================================================================ Description: Datasets should have at least one principal investigator
Rationale: Every dataset should have an associated principal investigator for data attribution and contact purposes.
Expected: No violations Found: 6434 violations
Sample violations:
RealDictRow({'datasetid': 6122, 'datasetname': None}) RealDictRow({'datasetid': 8533, 'datasetname': None}) RealDictRow({'datasetid': 66609, 'datasetname': None}) RealDictRow({'datasetid': 10627, 'datasetname': None}) RealDictRow({'datasetid': 5161, 'datasetname': None}) RealDictRow({'datasetid': 10876, 'datasetname': None}) RealDictRow({'datasetid': 6680, 'datasetname': None}) RealDictRow({'datasetid': 8716, 'datasetname': None}) RealDictRow({'datasetid': 10293, 'datasetname': None}) RealDictRow({'datasetid': 9408, 'datasetname': None}) assert 6434 == 0 + where 6434 = len([RealDictRow({'datasetid': 6122, 'datasetname': None}), RealDictRow({'datasetid': 8533, 'datasetname': None}), RealDictRow({'datasetid': 66609, 'datasetname': None}), RealDictRow({'datasetid': 10627, 'datasetname': None}), RealDictRow({'datasetid': 5161, 'datasetname': None}), RealDictRow({'datasetid': 10876, 'datasetname': None}), ...])
Rationale: Every dataset should have an associated principal investigator for data attribution and contact purposes.
Remediation: - Research and add PI information - Contact data owner to identify responsible investigator
comp_002: collectionunits_have_dates#
Severity: WARNING
Description: collectionunits should have collection dates
Affected Tables: collectionunits
Error: AssertionError: ================================================================================ Test Failed: collectionunits_have_dates (comp_002) Category: data_completeness ================================================================================ Description: collectionunits should have collection dates
Rationale: Collection dates are critical for temporal analysis and data quality.
Expected: No violations Found: 13637 violations
Sample violations:
RealDictRow({'collectionunitid': 1, 'handle': '15-1'}) RealDictRow({'collectionunitid': 2, 'handle': '15-2'}) RealDictRow({'collectionunitid': 3, 'handle': '16-1'}) RealDictRow({'collectionunitid': 4, 'handle': '17-1'}) RealDictRow({'collectionunitid': 5, 'handle': '17-2'}) RealDictRow({'collectionunitid': 6, 'handle': '17-3'}) RealDictRow({'collectionunitid': 7, 'handle': '3PINES'}) RealDictRow({'collectionunitid': 8, 'handle': 'ABALONE'}) RealDictRow({'collectionunitid': 10, 'handle': 'ADC001'}) RealDictRow({'collectionunitid': 11, 'handle': 'ADYCHA'}) assert 13637 == 0 + where 13637 = len([RealDictRow({'collectionunitid': 1, 'handle': '15-1'}), RealDictRow({'collectionunitid': 2, 'handle': '15-2'}), RealDictRow({'collectionunitid': 3, 'handle': '16-1'}), RealDictRow({'collectionunitid': 4, 'handle': '17-1'}), RealDictRow({'collectionunitid': 5, 'handle': '17-2'}), RealDictRow({'collectionunitid': 6, 'handle': '17-3'}), ...])
Rationale: Collection dates are critical for temporal analysis and data quality.
Remediation: - Review original data sources for date information. - Derive dates from publications where available. - Record the decision making processes at a Constituent Database level.
comp_003: taxa_have_been_added_by_stewards#
Severity: WARNING
Description: When a taxon is submitted to Neotoma there should be a person associated with that submission
Affected Tables: taxa
Error: AssertionError: ================================================================================ Test Failed: taxa_have_been_added_by_stewards (comp_003) Category: data_completeness ================================================================================ Description: When a taxon is submitted to Neotoma there should be a person associated with that submission
Rationale: We should know who placed a taxon into the hierarchy so we have some background on any decision making or choices that defined that placement.
Expected: No violations Found: 5189 violations
Sample violations:
RealDictRow({'taxonid': 30}) RealDictRow({'taxonid': 63}) RealDictRow({'taxonid': 6261}) RealDictRow({'taxonid': 96}) RealDictRow({'taxonid': 97}) RealDictRow({'taxonid': 184}) RealDictRow({'taxonid': 2140}) RealDictRow({'taxonid': 276}) RealDictRow({'taxonid': 296}) RealDictRow({'taxonid': 305}) assert 5189 == 0 + where 5189 = len([RealDictRow({'taxonid': 30}), RealDictRow({'taxonid': 63}), RealDictRow({'taxonid': 6261}), RealDictRow({'taxonid': 96}), RealDictRow({'taxonid': 97}), RealDictRow({'taxonid': 184}), ...])
Rationale: We should know who placed a taxon into the hierarchy so we have some background on any decision making or choices that defined that placement.
Remediation: - Confirm placement with stewards, identify those stewards as the validators.
✅ Passed Tests#
- comp_004: sample_ages_for_samples
Data Validity#
Pass Rate: 2/5
❌ Failed Tests#
valid_003: valid_terminal_taxa_have_values#
Severity: WARNING
Description: Taxa that are identified as 'leaves' in the database should be associated with values in the database.
Affected Tables: taxa, variables
Error: AssertionError: ================================================================================ Test Failed: valid_terminal_taxa_have_values (valid_003) Category: data_validity ================================================================================ Description: Taxa that are identified as 'leaves' in the database should be associated with values in the database.
Rationale: If a taxon was entered into the database as a terminal leaf in the hierarchy, it ought to be associated with some data entry. If it is absent from the variables table, then it was entered and never used.
Expected: No violations Found: 17761 violations
Sample violations:
RealDictRow({'taxonid': 183, 'taxoncode': '[Mimdae]', 'taxonname': 'Mimosoideae', 'author': 'de Candolle, 1825', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 9757, 'validatorid': 44, 'validatedate': datetime.date(2020, 7, 27), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2020, 7, 27, 2, 4, 41), 'count': 0}) RealDictRow({'taxonid': 187, 'taxoncode': 'Bryida.ud', 'taxonname': 'Bryopsida undiff.', 'author': 'Pax, 1900', 'valid': True, 'highertaxonid': 659, 'extinct': False, 'taxagroupid': 'BRY', 'publicationid': 311, 'validatorid': 44, 'validatedate': datetime.date(2015, 1, 3), 'notes': None, 'recdatecreated': datetime.datetime(2013, 1, 1, 0, 0), 'recdatemodified': datetime.datetime(2015, 1, 3, 16, 55, 26), 'count': 0}) RealDictRow({'taxonid': 196, 'taxoncode': '[Osu.ud]', 'taxonname': 'Osmunda undiff.', 'author': 'Linnaeus, 1753', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 9777, 'validatorid': 44, 'validatedate': datetime.date(2017, 4, 5), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2017, 4, 5, 18, 44, 51), 'count': 0}) RealDictRow({'taxonid': 227, 'taxoncode': '[Pll.fi/my]', 'taxonname': 'Polygonella fimbriata/P. myriophylla', 'author': '(Elliott) Horton, 1963|(Small) Horton, 1963', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 10135, 'validatorid': 44, 'validatedate': datetime.date(2017, 10, 17), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2017, 10, 17, 15, 42, 41), 'count': 0}) RealDictRow({'taxonid': 229, 'taxoncode': '[Pol.ud]', 'taxonname': 'Polygonum undiff.', 'author': 'Linnaeus, 1753', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 3702, 'validatorid': 44, 'validatedate': datetime.date(2017, 10, 17), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2017, 10, 17, 3, 6, 4), 'count': 0}) RealDictRow({'taxonid': 273, 'taxoncode': 'Slx.ve', 'taxonname': 'Salix vestita', 'author': 'Pursh, 1814[1813]', 'valid': True, 'highertaxonid': 271, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 3715, 'validatorid': 44, 'validatedate': datetime.date(2015, 1, 3), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2015, 1, 3, 16, 45, 57), 'count': 0}) RealDictRow({'taxonid': 291, 'taxoncode': '[shrb.ud]', 'taxonname': 'Shrubs undiff.', 'author': None, 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': None, 'validatorid': 44, 'validatedate': datetime.date(2015, 1, 10), 'notes': None, 'recdatecreated': datetime.datetime(2013, 1, 1, 0, 0), 'recdatemodified': datetime.datetime(2015, 1, 10, 18, 40, 32), 'count': 0}) RealDictRow({'taxonid': 303, 'taxoncode': '[tree.ud]', 'taxonname': 'Trees undiff.', 'author': None, 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': None, 'validatorid': 44, 'validatedate': datetime.date(2015, 1, 10), 'notes': None, 'recdatecreated': datetime.datetime(2013, 1, 1, 0, 0), 'recdatemodified': datetime.datetime(2015, 1, 10, 18, 48, 53), 'count': 0}) RealDictRow({'taxonid': 333, 'taxoncode': 'Shr', 'taxonname': 'Schrankia', 'author': 'Willdenow, 1806', 'valid': True, 'highertaxonid': 29012, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 314, 'validatorid': 44, 'validatedate': datetime.date(2017, 3, 23), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2017, 3, 23, 14, 49, 36), 'count': 0}) RealDictRow({'taxonid': 401, 'taxoncode': 'Ama.re-t', 'taxonname': 'Amaranthus retroflexus-type', 'author': 'Linnaeus, 1753', 'valid': True, 'highertaxonid': 5431, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 3658, 'validatorid': 44, 'validatedate': datetime.date(2017, 2, 7), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2017, 2, 7, 20, 10, 13), 'count': 0}) assert 17761 == 0 + where 17761 = len([RealDictRow({'taxonid': 183, 'taxoncode': '[Mimdae]', 'taxonname': 'Mimosoideae', 'author': 'de Candolle, 1825', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 9757, 'validatorid': 44, 'validatedate': datetime.date(2020, 7, 27), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2020, 7, 27, 2, 4, 41), 'count': 0}), RealDictRow({'taxonid': 187, 'taxoncode': 'Bryida.ud', 'taxonname': 'Bryopsida undiff.', 'author': 'Pax, 1900', 'valid': True, 'highertaxonid': 659, 'extinct': False, 'taxagroupid': 'BRY', 'publicationid': 311, 'validatorid': 44, 'validatedate': datetime.date(2015, 1, 3), 'notes': None, 'recdatecreated': datetime.datetime(2013, 1, 1, 0, 0), 'recdatemodified': datetime.datetime(2015, 1, 3, 16, 55, 26), 'count': 0}), RealDictRow({'taxonid': 196, 'taxoncode': '[Osu.ud]', 'taxonname': 'Osmunda undiff.', 'author': 'Linnaeus, 1753', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 9777, 'validatorid': 44, 'validatedate': datetime.date(2017, 4, 5), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recd... 1963|(Small) Horton, 1963', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 10135, 'validatorid': 44, 'validatedate': datetime.date(2017, 10, 17), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2017, 10, 17, 15, 42, 41), 'count': 0}), RealDictRow({'taxonid': 229, 'taxoncode': '[Pol.ud]', 'taxonname': 'Polygonum undiff.', 'author': 'Linnaeus, 1753', 'valid': False, 'highertaxonid': None, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 3702, 'validatorid': 44, 'validatedate': datetime.date(2017, 10, 17), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2017, 10, 17, 3, 6, 4), 'count': 0}), RealDictRow({'taxonid': 273, 'taxoncode': 'Slx.ve', 'taxonname': 'Salix vestita', 'author': 'Pursh, 1814[1813]', 'valid': True, 'highertaxonid': 271, 'extinct': False, 'taxagroupid': 'VPL', 'publicationid': 3715, 'validatorid': 44, 'validatedate': datetime.date(2015, 1, 3), 'notes': None, 'recdatecreated': datetime.datetime(2012, 3, 21, 0, 0), 'recdatemodified': datetime.datetime(2015, 1, 3, 16, 45, 57), 'count': 0}), ...])
Rationale: If a taxon was entered into the database as a terminal leaf in the hierarchy, it ought to be associated with some data entry. If it is absent from the variables table, then it was entered and never used.
Remediation: - Check with data stewards for the particular data type. - Ensure that the taxa are valid.
valid_005: sample_ages_scaled_properly#
Severity: ERROR
Description: Sample ages and chronologies should have ages that have the correct age range. younger ages should always be more recent than older ages.
Affected Tables: sampleages
Error: AssertionError: ================================================================================ Test Failed: sample_ages_scaled_properly (valid_005) Category: data_validity ================================================================================ Description: Sample ages and chronologies should have ages that have the correct age range. younger ages should always be more recent than older ages.
Rationale: The older and younger dates need to be ordered properly.
Expected: No violations Found: 2 violations
Sample violations:
RealDictRow({'sampleageid': 138101, 'sampleid': 148850, 'chronologyid': 8389, 'age': None, 'ageyounger': 1970.0, 'ageolder': 1995.0, 'recdatecreated': datetime.datetime(2015, 5, 1, 1, 19, 22), 'recdatemodified': datetime.datetime(2015, 5, 1, 1, 19, 22)}) RealDictRow({'sampleageid': 138102, 'sampleid': 148849, 'chronologyid': 8389, 'age': None, 'ageyounger': 1970.0, 'ageolder': 1995.0, 'recdatecreated': datetime.datetime(2015, 5, 1, 1, 19, 22), 'recdatemodified': datetime.datetime(2015, 5, 1, 1, 19, 22)}) assert 2 == 0 + where 2 = len([RealDictRow({'sampleageid': 138101, 'sampleid': 148850, 'chronologyid': 8389, 'age': None, 'ageyounger': 1970.0, 'ageolder': 1995.0, 'recdatecreated': datetime.datetime(2015, 5, 1, 1, 19, 22), 'recdatemodified': datetime.datetime(2015, 5, 1, 1, 19, 22)}), RealDictRow({'sampleageid': 138102, 'sampleid': 148849, 'chronologyid': 8389, 'age': None, 'ageyounger': 1970.0, 'ageolder': 1995.0, 'recdatecreated': datetime.datetime(2015, 5, 1, 1, 19, 22), 'recdatemodified': datetime.datetime(2015, 5, 1, 1, 19, 22)})])
Rationale: The older and younger dates need to be ordered properly.
Remediation: - Likely we just need to flip the ages around from younger to older. - The chronology may also need some examination.
valid_006: samples_per_analysisunit#
Severity: WARNING
Description: Although some datasets may have multiple samples per analysis unit per dataset, we should generally expect that most analysis units have only one set of samples.
Affected Tables: samples, datasets
Error: AssertionError: ================================================================================ Test Failed: samples_per_analysisunit (valid_006) Category: data_validity ================================================================================ Description: Although some datasets may have multiple samples per analysis unit per dataset, we should generally expect that most analysis units have only one set of samples.
Rationale: We definitely see duplicate samples within an analysis unit (and within a dataset), but we want to make sure that we're not seeing errors here. This is set as a warning, and possibly we can work to improve the query a bit.
Expected: No violations Found: 2996 violations
Sample violations:
RealDictRow({'array_agg': [1351, 1352], 'datasetid': 202, 'databasename': 'North American Pollen Database'}) RealDictRow({'array_agg': [1346, 1347], 'datasetid': 202, 'databasename': 'North American Pollen Database'}) RealDictRow({'array_agg': [1354, 1355], 'datasetid': 202, 'databasename': 'North American Pollen Database'}) RealDictRow({'array_agg': [675985, 676028, 675974, 676031, 676027, 676035, 676033, 676036, 676030, 676029], 'datasetid': 61117, 'databasename': 'Neotoma'}) RealDictRow({'array_agg': [116071, 116072], 'datasetid': 9566, 'databasename': 'FAUNMAP'}) RealDictRow({'array_agg': [500963, 500966], 'datasetid': 50240, 'databasename': 'European Pollen Database'}) RealDictRow({'array_agg': [676037, 676042, 676038, 676041, 676039, 676045, 676043, 676046, 676044, 676040], 'datasetid': 61117, 'databasename': 'Neotoma'}) RealDictRow({'array_agg': [704752, 704753], 'datasetid': 66097, 'databasename': 'FAUNMAP'}) RealDictRow({'array_agg': [117735, 117736, 117737, 117738], 'datasetid': 10006, 'databasename': 'FAUNMAP'}) RealDictRow({'array_agg': [116633, 116634, 116635, 116636, 116637, 116638, 116639], 'datasetid': 9736, 'databasename': 'FAUNMAP'}) assert 2996 == 0 + where 2996 = len([RealDictRow({'array_agg': [1351, 1352], 'datasetid': 202, 'databasename': 'North American Pollen Database'}), RealDictRow({'array_agg': [1346, 1347], 'datasetid': 202, 'databasename': 'North American Pollen Database'}), RealDictRow({'array_agg': [1354, 1355], 'datasetid': 202, 'databasename': 'North American Pollen Database'}), RealDictRow({'array_agg': [675985, 676028, 675974, 676031, 676027, 676035, 676033, 676036, 676030, 676029], 'datasetid': 61117, 'databasename': 'Neotoma'}), RealDictRow({'array_agg': [116071, 116072], 'datasetid': 9566, 'databasename': 'FAUNMAP'}), RealDictRow({'array_agg': [500963, 500966], 'datasetid': 50240, 'databasename': 'European Pollen Database'}), ...])
Rationale: We definitely see duplicate samples within an analysis unit (and within a dataset), but we want to make sure that we're not seeing errors here. This is set as a warning, and possibly we can work to improve the query a bit.
Remediation: - Check the dataset to see if the samples are legitimately multiple samples within a single dataset. - Check with the original publication, or upload data steward. - Potentially remove duplicate or empty samples if they exist.
✅ Passed Tests#
- valid_001: coordinates_in_valid_range
- valid_004: sites_not_on_equator
Business Rules#
Pass Rate: 1/3
❌ Failed Tests#
bix_002: taxonnames_are_not_duplicated_within_groups#
Severity: ERROR
Description: Although different ecological groups may have similar taxon names (e.g., Abronia in reptiles, plants, protists and fungi), within groups the taxonomic name should be unique.
Affected Tables: taxa
Error: AssertionError: ================================================================================ Test Failed: taxonnames_are_not_duplicated_within_groups (bix_002) Category: business_rules ================================================================================ Description: Although different ecological groups may have similar taxon names (e.g., Abronia in reptiles, plants, protists and fungi), within groups the taxonomic name should be unique.
Rationale: If the same name is entered multiple times within the taxonomic table for a particular taxonomic group, we can expect that there is likely some issue with conflicting hierarchies that needs to be resolved by the data stewardship team.
Expected: No violations Found: 140 violations
Sample violations:
RealDictRow({'taxonname': 'Parasida mckittricki', 'taxagroupid': 'INS', 'count': 2}) RealDictRow({'taxonname': 'Rhadine howdeni', 'taxagroupid': 'INS', 'count': 2}) RealDictRow({'taxonname': 'Jussiaea', 'taxagroupid': 'VPL', 'count': 2}) RealDictRow({'taxonname': 'Anchicera', 'taxagroupid': 'INS', 'count': 2}) RealDictRow({'taxonname': 'Acacia seyal', 'taxagroupid': 'VPL', 'count': 2}) RealDictRow({'taxonname': 'Epistrophe', 'taxagroupid': 'INS', 'count': 2}) RealDictRow({'taxonname': 'Dictamnus', 'taxagroupid': 'VPL', 'count': 2}) RealDictRow({'taxonname': 'Heliotropium-type', 'taxagroupid': 'VPL', 'count': 2}) RealDictRow({'taxonname': 'Pilea-type', 'taxagroupid': 'VPL', 'count': 2}) RealDictRow({'taxonname': 'Phyllanthus reticulatus-type', 'taxagroupid': 'VPL', 'count': 2}) assert 140 == 0 + where 140 = len([RealDictRow({'taxonname': 'Parasida mckittricki', 'taxagroupid': 'INS', 'count': 2}), RealDictRow({'taxonname': 'Rhadine howdeni', 'taxagroupid': 'INS', 'count': 2}), RealDictRow({'taxonname': 'Jussiaea', 'taxagroupid': 'VPL', 'count': 2}), RealDictRow({'taxonname': 'Anchicera', 'taxagroupid': 'INS', 'count': 2}), RealDictRow({'taxonname': 'Acacia seyal', 'taxagroupid': 'VPL', 'count': 2}), RealDictRow({'taxonname': 'Epistrophe', 'taxagroupid': 'INS', 'count': 2}), ...])
Rationale: If the same name is entered multiple times within the taxonomic table for a particular taxonomic group, we can expect that there is likely some issue with conflicting hierarchies that needs to be resolved by the data stewardship team.
Remediation:
['Identify the correct entry, remove duplicate entries.']
bix_003: variable_elements_in_use#
Severity: WARNING
Description: Over time a number of variable contexts, units and elements have been created but not neccessarily used. In some cases this may have resulted from improperly entered data in the Tilia spreadsheet.
Affected Tables: variablecontexts, variableelements, variableunits
Error: AssertionError: ================================================================================ Test Failed: variable_elements_in_use (bix_003) Category: business_rules ================================================================================ Description: Over time a number of variable contexts, units and elements have been created but not neccessarily used. In some cases this may have resulted from improperly entered data in the Tilia spreadsheet.
Rationale: We want to ensure that the variables defined within the controlled vocabularies match with accepted external values, and that the variables that a user may select for data analysis are reflected in reality.
Expected: No violations Found: 91 violations
Sample violations:
RealDictRow({'table': 'variablecontexts', 'identifier': 125, 'value': 'Pre-Quaternary'}) RealDictRow({'table': 'variablecontexts', 'identifier': 133, 'value': 'corroded'}) RealDictRow({'table': 'variableunits', 'identifier': 209, 'value': 'g/cm2/yr'}) RealDictRow({'table': 'variableunits', 'identifier': 218, 'value': 'count of PCR replicates'}) RealDictRow({'table': 'variableunits', 'identifier': 27, 'value': 'units not specified'}) RealDictRow({'table': 'variableunits', 'identifier': 241, 'value': 'mmol/mol'}) RealDictRow({'table': 'variableunits', 'identifier': 211, 'value': 'kg/m2/yr'}) RealDictRow({'table': 'variableunits', 'identifier': 46, 'value': 'elemental ratio'}) RealDictRow({'table': 'variableunits', 'identifier': 32, 'value': 'meq/L'}) RealDictRow({'table': 'variableunits', 'identifier': 78, 'value': '1-2 scale'}) assert 91 == 0 + where 91 = len([RealDictRow({'table': 'variablecontexts', 'identifier': 125, 'value': 'Pre-Quaternary'}), RealDictRow({'table': 'variablecontexts', 'identifier': 133, 'value': 'corroded'}), RealDictRow({'table': 'variableunits', 'identifier': 209, 'value': 'g/cm2/yr'}), RealDictRow({'table': 'variableunits', 'identifier': 218, 'value': 'count of PCR replicates'}), RealDictRow({'table': 'variableunits', 'identifier': 27, 'value': 'units not specified'}), RealDictRow({'table': 'variableunits', 'identifier': 241, 'value': 'mmol/mol'}), ...])
Rationale: We want to ensure that the variables defined within the controlled vocabularies match with accepted external values, and that the variables that a user may select for data analysis are reflected in reality.
Remediation: - Where possible, remove unused units/elements/contexts - Ensure any near-duplicates of existing units/elements/contexts are using best-practice or accepted notations.
✅ Passed Tests#
- biz_001: modern_samples_have_recent_dates