Skip to content

Neotoma Database Backup and Restoration Guide#

Overview#

This guide documents the backup and restoration procedures for the Neotoma database system. The Neotoma project maintains multiple backup strategies to ensure data safety and support different operational needs. Understanding when and how to use each method is critical for maintaining data integrity during major operations such as bulk uploads or system maintenance.

Backup Strategies#

The Neotoma database system implements three distinct backup approaches, each designed for specific use cases:

1. AWS Console Server Snapshots#

Full server-level backups created through the AWS RDS console. These snapshots capture the entire database server instance, including all databases (Neotoma proper, the Neotoma holding tank, and system databases).

Best used for:

  • Before major operations (bulk uploads, major schema changes)
  • Complete disaster recovery scenarios
  • Creating point-in-time recovery points

2. SQL-Based Holding Tank Restoration#

A targeted restoration process that rebuilds the holding-tank version of Neotoma from the production Neotoma database using SQL commands.

Best used for:

  • Refreshing the staging/holding tank environment
  • Testing workflows with current production data
  • Resetting the holding tank to a known state

3. Clean Database Repository (Automated Public Snapshots)#

An automated monthly process that creates sanitized database snapshots for public distribution. This process removes sensitive information (passwords, personal contact details) while maintaining data structure and scientific content.

Best used for:

  • Public data distribution
  • Setting up external development environments
  • Providing snapshots for research collaborations

Database Architecture Context#

The Neotoma database server hosts three primary databases:

  • neotoma - The production database containing validated, published data
  • neotomatank - The staging database for workflow development, data validation and processing
  • postgres - The system database (required for certain administrative operations)

Understanding this architecture is important because different backup methods interact with these databases in different ways.


Method 1: AWS Console Server Snapshots#

Purpose#

Server snapshots create a complete backup of the entire RDS database instance. This is the most comprehensive backup method and includes all databases, configurations, and system tables.

Prerequisites#

  • Access to the AWS Console
  • Appropriate IAM permissions for RDS operations
  • Connection to the correct AWS account/region (US East 2)

Creating a Server Snapshot#

Step 1: Access RDS Console#

  1. Log into the AWS Console
  2. Navigate to RDS (you can search for "RDS" or find it under "Database" services)
  3. Select Aurora and RDS from the services menu
  4. If you've starred this service, it will appear in your favorites bar

Step 2: Select the Database Instance#

  1. Click on Databases in the left navigation menu
  2. Select the neotoma-private instance (or your specific instance name)
  3. You should see the instance status, configuration details, and monitoring information

Step 3: Navigate to Snapshot Options#

  1. Click on the Maintenance & backups tab
  2. Scroll down to the Snapshots section
  3. You will see a list of existing manual and automated snapshots
  4. On the right side, you'll see action buttons: Restore, Delete, and Take snapshot

Step 4: Create the Snapshot#

  1. Click Take snapshot
  2. Configure snapshot settings:
  3. Snapshot type: DB instance
  4. DB instance identifier: neotoma-private (should auto-populate)
  5. Snapshot name: Use a descriptive name following the naming convention

    • Examples: before-node-upload, before-sisol, pre-bulk-upload-2024-01
  6. Click Take Snapshot

Step 5: Monitor Snapshot Progress#

  • The snapshot will appear in the snapshots list with a status indicator
  • You can monitor progress in the RDS console under Snapshots > Manual snapshots
  • Snapshot creation time varies based on database size

⚠️ Important: Database performance may be slightly impacted during snapshot creation due to increased I/O operations. Plan snapshots during lower-traffic periods when possible.

Viewing Existing Snapshots#

  1. From the RDS main menu, select Snapshots
  2. Choose Manual snapshots to see user-created backups
  3. Each snapshot displays:
  4. Snapshot name
  5. Creation date/time
  6. Database instance
  7. Size
  8. Status

Restoring from a Server Snapshot#

⚠️ CRITICAL WARNING: Restoring from a snapshot is a significant operation. The restore process creates a NEW database instance - it does not overwrite the existing one. Ensure you understand the implications before proceeding.

Restoration Process#

  1. Navigate to RDS > Snapshots > Manual snapshots
  2. Select the snapshot you want to restore from
  3. Click Actions > Restore snapshot

📝 Note: The specific restoration steps and configuration options require additional documentation. DO NOT click through the restoration process without clear instructions, as this will immediately begin provisioning a new database instance.

What Gets Restored#

A server snapshot restoration includes:

  • ✅ All databases (neotoma, neotomaholdingtank, postgres)
  • ✅ All user accounts and permissions
  • ✅ All data as of the snapshot creation time
  • ✅ Database configuration settings
  • ✅ Security groups and network settings

Best Practices#

Naming Conventions:

  • Use descriptive, date-stamped names: before-[operation]-YYYY-MM-DD
  • Include the reason for the snapshot: pre-bulk-upload, before-schema-change
  • Keep names concise but meaningful

When to Create Snapshots:

  • Before any bulk data upload operation
  • Prior to schema modifications
  • Before major system updates or maintenance
  • After successful completion of major operations (for rollback points)

Snapshot Management:

  • Review and delete old snapshots periodically to manage storage costs
  • Keep critical milestone snapshots for longer periods
  • Document snapshot purposes for team reference

Additional Options#

Export to S3: Snapshots can also be exported to S3 for long-term archival or cross-region backup. This option is available through the Actions menu when viewing a snapshot.


Method 2: SQL-Based Holding Tank Restoration#

Purpose#

This method provides a quick way to restore the neotomaholdingtank database using the production neotoma database as a template. This is useful for refreshing the staging environment with current production data.

Prerequisites#

  • Access to the PostgreSQL command line (psql) or a database client
  • Database connection credentials
  • IMPORTANT: Connection to the postgres system database (not neotoma or neotomaholdingtank)

Process Overview#

The restoration process involves:

  1. Disconnecting all active connections to the holding tank database
  2. Dropping the existing holding tank database
  3. Creating a new holding tank database using the production database as a template
  4. Verifying the restoration

Understanding the Connection Requirement#

Before running the restoration, you must connect to the postgres system database. This is because:

  • You cannot drop a database while connected to it
  • The template operation requires a neutral connection point
  • Administrative operations need to be run from the system database context

The Restoration SQL Script#

📝 Note: The complete SQL script file is located in the Neotoma SQL folder repository. The file contains the restoration function and should be reviewed before execution.

Key Steps in the Script#

Step 1: Terminate Active Connections

The script queries pg_stat_activity to identify and terminate all connections to the holding tank database. This is necessary because PostgreSQL cannot drop a database with active connections.


Step 2: Drop and Recreate

Once connections are terminated, the script:

  • Drops the existing neotomaholdingtank database
  • Creates a new neotomaholdingtank database using neotoma as the template

Step 3: Template Copy

Using PostgreSQL's CREATE DATABASE ... WITH TEMPLATE command, the script efficiently copies:

  • Complete schema structure
  • All table data
  • Indexes and constraints
  • User permissions
  • Sequences and other database objects

Execution Instructions#

⚠️ Information Gap: Detailed execution instructions require the actual SQL script file. The following outlines the general approach:

  1. Connect to the postgres database
psql -h [hostname] -U [username] -d postgres
  1. Run the restoration script

  2. Option A: Execute the SQL file directly

  3. Option B: Run the stored restoration function (if created)

  4. Monitor execution

  5. Watch for error messages

  6. Confirm successful completion

  7. Verify restoration

  8. Check row counts in key tables

  9. Verify recent data is present
  10. Test application connectivity

When to Use This Method#

Appropriate scenarios:

  • Refreshing test/staging environment with production data
  • Setting up the holding tank after a server restore
  • Resetting the holding tank to a known good state
  • Preparing for data validation workflows

Not appropriate for:

  • Production database recovery (use server snapshots instead)
  • Preserving specific holding tank states (consider a holding tank-specific backup)

Important Considerations#

Active Connections:

The script must terminate active connections to succeed. This means:

  • Any users connected to the holding tank will be disconnected
  • Running API processes connecting to the holding tank will fail
  • Coordinate with team members before running

Execution Time:

  • Duration depends on database size
  • Typically faster than a dump/restore operation
  • Still requires time for the template copy operation

Database Locking:

During the operation, the source database (neotoma) will be briefly locked for reading. Plan accordingly for production systems.


Method 3: Clean Database Repository (Automated Public Snapshots)#

Purpose#

The Clean Database Repository system creates monthly sanitized snapshots of the Neotoma database for public distribution. This automated process removes sensitive personal information while preserving all scientific data and structure, making it safe for public download and external development use.

System Architecture#

This backup method uses AWS infrastructure to automate the entire process:

  • AWS Batch: Executes the backup job with extended runtime support (beyond Lambda's 15-minute limit)
  • AWS Fargate: Schedules and manages job execution
  • Docker Container: Provides isolated environment with PostgreSQL tools
  • S3 Storage: Hosts the resulting backup files for public access

Automated Schedule#

The clean backup process runs automatically:

  • Schedule: 2:00 AM on the 1st day of each month
  • Trigger: AWS Fargate scheduled task (cron-based)
  • Duration: Varies based on database size (typically 15-45 minutes)

What Gets Sanitized#

The cleaning process protects sensitive information by overwriting:

In the stewards table:

  • ✅ Usernames → Random 10-character strings
  • ✅ Passwords → Random 10-character strings
  • ✅ Email addresses → Anonymized
  • ✅ Physical addresses → Anonymized
  • ✅ Phone numbers → Anonymized
  • ✅ Fax numbers → Anonymized

What is preserved:

  • ✅ All scientific data tables
  • ✅ Complete schema structure
  • ✅ Relationships and foreign keys
  • ✅ All non-personal metadata

What is excluded (for performance):

  • ts.hydralakes - Large spatial dataset
  • ts.icesheets - Large spatial dataset
  • ts.globalmammals - Large spatial dataset

📝 Note: The excluded tables are large spatial datasets primarily used for mapping. Documentation for importing these tables is maintained separately.

Accessing Clean Backups#

S3 Bucket Location:

  • Bucket name: neotoma-remote-store
  • Path: neotoma-remote-store/[files]

Available Files:

  1. Latest Snapshot

  2. Filename: neotoma-clean-latest.tar.gz

  3. Always points to the most recent backup
  4. Recommended for most users

  5. Dated Archives

  6. Format: neotoma-clean-MMDD.tar.gz

  7. Examples: neotoma-clean-0105.tar.gz, neotoma-clean-0205.tar.gz
  8. Retained for historical access
  9. One file per month

Using a Clean Backup#

Downloading#

📝 Information Gap: Specific download URLs or AWS CLI commands needed here.

# Example download command (requires AWS CLI)
aws s3 cp s3://neotoma-remote-store/neotoma-clean-latest.tar.gz .

Restoring Locally#

  1. Extract the archive
tar -xzf neotoma-clean-latest.tar.gz
  1. Restore to PostgreSQL
pg_restore -h localhost -U postgres -d neotoma [extracted-file]
  1. Install required extensions
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS postgis_topology;
-- Additional extensions as needed

Technical Implementation Details#

🔧 For System Administrators: The following section describes the internal workings of the automated backup system.

Docker Container Workflow#

The backup process runs in a custom Docker container that:

  1. Installs dependencies
  2. PostgreSQL client tools
  3. PostGIS extensions
  4. OpenSSH, curl, netcat utilities

  5. Sets up directory structure

/app/        - Application scripts
/archives/   - Temporary storage for dumps
/assets/     - Supporting files
/logs/       - Execution logs
  1. Executes backup sequence

  2. Starts PostgreSQL service

  3. Waits for database readiness (with timeout logic)
  4. Runs the scrubbing script
  5. Creates compressed archive
  6. Uploads to S3
  7. Cleans up temporary files

Scrubbing Process Details#

The scrub_database.sh script performs the following operations:

Phase 1: Initial Dump

# Dumps the database excluding large spatial tables and system schemas
pg_dump -Fc -h [host] -U [user] \
  --exclude-table-data=ts.hydralakes \
  --exclude-table-data=ts.icesheets \
  --exclude-table-data=ts.globalmammals \
  --exclude-schema=cron \
  -f temp_dump.backup

Phase 2: Local Restoration

  • Drops and recreates local database
  • Installs PostGIS extensions
  • Restores from temporary dump

Phase 3: Data Sanitization

-- Overwrites sensitive steward information
UPDATE stewards 
SET username = [random_string],
    password = [random_string],
    email = [anonymized],
    address = [anonymized],
    phone = [anonymized],
    fax = [anonymized];

Phase 4: Final Export and Upload

  • Creates final compressed dump
  • Uploads to S3 with two names:
  • neotoma-clean-latest.tar.gz (overwrites previous)
  • neotoma-clean-MMDD.tar.gz (date-stamped archive)

Monitoring and Logging#

Log Files:

The container generates detailed logs at each step, viewable through:

  • AWS Batch job logs
  • CloudWatch Logs (if configured)

Log Content:

  • 🔍 User context and permissions
  • 🔍 PostgreSQL data directory location
  • 🔍 Database connection status
  • 🔍 Dump progress indicators
  • 🔍 Sanitization confirmation
  • 🔍 Upload status
  • 🔍 Cleanup completion

Monitoring the Job:

  1. Access AWS Batch console
  2. Navigate to Jobs
  3. Find the scheduled backup job
  4. View logs for execution details

Build and Deployment#

📝 For Developers: Updates to the backup system

The backup system is maintained in the clean_database GitHub repository.

Making Changes:

  1. Modify scripts in the repository:

  2. Dockerfile - Container definition

  3. scrub_database.sh - Main backup logic
  4. batch_entrypoint.sh - Container entry point
  5. connect_database.sh - Database connection utilities

  6. Build and push to ECR:

# Run the build script
./build_and_push.sh

This script:

  • Builds the Docker image
  • Tags with appropriate version
  • Pushes to AWS Elastic Container Registry (ECR)
  • Updates the Batch job definition

  • Test the job:

  • Manually trigger through AWS Batch console

  • Monitor logs for errors
  • Verify S3 upload
  • Test restoration of generated backup

GitHub Actions:

📝 Information Gap: Clarification needed on whether GitHub Actions are used for automated deployment of changes to the backup system.


Choosing the Right Backup Method#

Scenario Recommended Method Rationale
Before bulk data upload AWS Server Snapshot Full recovery point including all databases
Before schema changes AWS Server Snapshot Complete rollback capability
Refresh staging environment SQL Holding Tank Restore Quick, targeted update of test data
External developer setup Clean Database Repository Safe, sanitized data with no sensitive info
Monthly public release Clean Database Repository Automated, consistent public distribution
Disaster recovery AWS Server Snapshot Complete system restoration
Local development setup Clean Database Repository No sensitive data, manageable size

Quick Reference Commands#

Check Database Connections#

-- View active connections to a specific database
SELECT * FROM pg_stat_activity 
WHERE datname = 'neotomaholdingtank';

Connect to System Database#

# Using psql
psql -h [hostname] -U [username] -d postgres

Download Latest Clean Backup#

# Using AWS CLI
aws s3 cp s3://neotoma-remote-store/neotoma-clean-latest.tar.gz .

Troubleshooting#

Common Issues#

Snapshot Creation Fails:

  • Verify IAM permissions include rds:CreateDBSnapshot
  • Check that instance is in "available" state
  • Ensure snapshot name doesn't already exist

Cannot Drop Holding Tank:

  • Verify connection to postgres database (not neotoma or neotomaholdingtank)
  • Check for active connections using pg_stat_activity
  • Ensure user has DROP DATABASE privileges

Clean Backup Job Fails:

  • Check AWS Batch job logs in CloudWatch
  • Verify ECR image is accessible
  • Confirm database connection credentials are current
  • Check S3 bucket permissions

Restore Takes Too Long:

  • Expected for large databases
  • Monitor with pg_stat_activity
  • Check disk I/O and available storage

Additional Resources#

📝 Information Gaps: The following sections need additional documentation:

  • Local Database Replication: Script for maintaining local development copies
  • Large Spatial Tables: Import procedures for hydralakes, icesheets, and globalmammals
  • Detailed Restoration Procedures: Step-by-step AWS snapshot restoration
  • Permission Requirements: Specific IAM policies and database roles needed
  • Monitoring Dashboard: AWS CloudWatch metrics for backup health

Document Maintenance#

Last Updated: [Date to be added]
Document Owner: [To be specified]
Review Schedule: Quarterly or after major infrastructure changes


This documentation is maintained as part of the Neotoma database project. For questions or clarifications, please contact the database administration team.