Neotoma Database Backup and Restoration Guide#

Overview#

This guide documents the backup and restoration procedures for the Neotoma database system. The Neotoma project maintains multiple backup strategies to ensure data safety and support different operational needs. Understanding when and how to use each method is critical for maintaining data integrity during major operations such as bulk uploads or system maintenance.

Backup Strategies#

The Neotoma database system implements three distinct backup approaches, each designed for specific use cases:

1. AWS Console Server Snapshots#

Full server-level backups created through the AWS RDS console. These snapshots capture the entire database server instance, including all databases (Neotoma proper, the Neotoma holding tank, and system databases).

Best used for:

Before major operations (bulk uploads, major schema changes)
Complete disaster recovery scenarios
Creating point-in-time recovery points

2. SQL-Based Holding Tank Restoration#

A targeted restoration process that rebuilds the holding-tank version of Neotoma from the production Neotoma database using SQL commands.

Best used for:

Refreshing the staging/holding tank environment
Testing workflows with current production data
Resetting the holding tank to a known state

3. Clean Database Repository (Automated Public Snapshots)#

An automated monthly process that creates sanitized database snapshots for public distribution. This process removes sensitive information (passwords, personal contact details) while maintaining data structure and scientific content.

Best used for:

Public data distribution
Setting up external development environments
Providing snapshots for research collaborations

Database Architecture Context#

The Neotoma database server hosts three primary databases:

neotoma - The production database containing validated, published data
neotomatank - The staging database for workflow development, data validation and processing
postgres - The system database (required for certain administrative operations)

Understanding this architecture is important because different backup methods interact with these databases in different ways.

Method 1: AWS Console Server Snapshots#

Purpose#

Server snapshots create a complete backup of the entire RDS database instance. This is the most comprehensive backup method and includes all databases, configurations, and system tables.

Prerequisites#

Access to the AWS Console
Appropriate IAM permissions for RDS operations
Connection to the correct AWS account/region (US East 2)

Creating a Server Snapshot#

Step 1: Access RDS Console#

Log into the AWS Console
Navigate to RDS (you can search for "RDS" or find it under "Database" services)
Select Aurora and RDS from the services menu
If you've starred this service, it will appear in your favorites bar

Step 2: Select the Database Instance#

Click on Databases in the left navigation menu
Select the neotoma-private instance (or your specific instance name)
You should see the instance status, configuration details, and monitoring information

Step 3: Navigate to Snapshot Options#

Click on the Maintenance & backups tab
Scroll down to the Snapshots section
You will see a list of existing manual and automated snapshots
On the right side, you'll see action buttons: Restore, Delete, and Take snapshot

Step 4: Create the Snapshot#

Click Take snapshot
Configure snapshot settings:
Snapshot type: DB instance
DB instance identifier: neotoma-private (should auto-populate)
Snapshot name: Use a descriptive name following the naming convention
- Examples: before-node-upload, before-sisol, pre-bulk-upload-2024-01
Click Take Snapshot

Step 5: Monitor Snapshot Progress#

The snapshot will appear in the snapshots list with a status indicator
You can monitor progress in the RDS console under Snapshots > Manual snapshots
Snapshot creation time varies based on database size

⚠️ Important: Database performance may be slightly impacted during snapshot creation due to increased I/O operations. Plan snapshots during lower-traffic periods when possible.

Viewing Existing Snapshots#

From the RDS main menu, select Snapshots
Choose Manual snapshots to see user-created backups
Each snapshot displays:
Snapshot name
Creation date/time
Database instance
Size
Status

Restoring from a Server Snapshot#

⚠️ CRITICAL WARNING: Restoring from a snapshot is a significant operation. The restore process creates a NEW database instance - it does not overwrite the existing one. Ensure you understand the implications before proceeding.

Restoration Process#

Navigate to RDS > Snapshots > Manual snapshots
Select the snapshot you want to restore from
Click Actions > Restore snapshot

📝 Note: The specific restoration steps and configuration options require additional documentation. DO NOT click through the restoration process without clear instructions, as this will immediately begin provisioning a new database instance.

What Gets Restored#

A server snapshot restoration includes:

✅ All databases (neotoma, neotomaholdingtank, postgres)
✅ All user accounts and permissions
✅ All data as of the snapshot creation time
✅ Database configuration settings
✅ Security groups and network settings

Best Practices#

Naming Conventions:

Use descriptive, date-stamped names: before-[operation]-YYYY-MM-DD
Include the reason for the snapshot: pre-bulk-upload, before-schema-change
Keep names concise but meaningful

When to Create Snapshots:

Before any bulk data upload operation
Prior to schema modifications
Before major system updates or maintenance
After successful completion of major operations (for rollback points)

Snapshot Management:

Review and delete old snapshots periodically to manage storage costs
Keep critical milestone snapshots for longer periods
Document snapshot purposes for team reference

Additional Options#

Export to S3: Snapshots can also be exported to S3 for long-term archival or cross-region backup. This option is available through the Actions menu when viewing a snapshot.

Method 2: SQL-Based Holding Tank Restoration#

Purpose#

This method provides a quick way to restore the neotomaholdingtank database using the production neotoma database as a template. This is useful for refreshing the staging environment with current production data.

Prerequisites#

Access to the PostgreSQL command line (psql) or a database client
Database connection credentials
IMPORTANT: Connection to the postgres system database (not neotoma or neotomaholdingtank)

Process Overview#

The restoration process involves:

Disconnecting all active connections to the holding tank database
Dropping the existing holding tank database
Creating a new holding tank database using the production database as a template
Verifying the restoration

Understanding the Connection Requirement#

Before running the restoration, you must connect to the postgres system database. This is because:

You cannot drop a database while connected to it
The template operation requires a neutral connection point
Administrative operations need to be run from the system database context

The Restoration SQL Script#

📝 Note: The complete SQL script file is located in the Neotoma SQL folder repository. The file contains the restoration function and should be reviewed before execution.

Key Steps in the Script#

Step 1: Terminate Active Connections

The script queries pg_stat_activity to identify and terminate all connections to the holding tank database. This is necessary because PostgreSQL cannot drop a database with active connections.

Step 2: Drop and Recreate

Once connections are terminated, the script:

Drops the existing neotomaholdingtank database
Creates a new neotomaholdingtank database using neotoma as the template

Step 3: Template Copy

Using PostgreSQL's CREATE DATABASE ... WITH TEMPLATE command, the script efficiently copies:

Complete schema structure
All table data
Indexes and constraints
User permissions
Sequences and other database objects

Execution Instructions#

⚠️ Information Gap: Detailed execution instructions require the actual SQL script file. The following outlines the general approach:

Connect to the postgres database

psql -h [hostname] -U [username] -d postgres

Run the restoration script
Option A: Execute the SQL file directly
Option B: Run the stored restoration function (if created)
Monitor execution
Watch for error messages
Confirm successful completion
Verify restoration
Check row counts in key tables
Verify recent data is present
Test application connectivity

When to Use This Method#

Appropriate scenarios:

Refreshing test/staging environment with production data
Setting up the holding tank after a server restore
Resetting the holding tank to a known good state
Preparing for data validation workflows

Not appropriate for:

Production database recovery (use server snapshots instead)
Preserving specific holding tank states (consider a holding tank-specific backup)

Important Considerations#

Active Connections:

The script must terminate active connections to succeed. This means:

Any users connected to the holding tank will be disconnected
Running API processes connecting to the holding tank will fail
Coordinate with team members before running

Execution Time:

Duration depends on database size
Typically faster than a dump/restore operation
Still requires time for the template copy operation

Database Locking:

During the operation, the source database (neotoma) will be briefly locked for reading. Plan accordingly for production systems.

Method 3: Clean Database Repository (Automated Public Snapshots)#

Purpose#

The Clean Database Repository system creates monthly sanitized snapshots of the Neotoma database for public distribution. This automated process removes sensitive personal information while preserving all scientific data and structure, making it safe for public download and external development use.

System Architecture#

This backup method uses AWS infrastructure to automate the entire process:

AWS Batch: Executes the backup job with extended runtime support (beyond Lambda's 15-minute limit)
AWS Fargate: Schedules and manages job execution
Docker Container: Provides isolated environment with PostgreSQL tools
S3 Storage: Hosts the resulting backup files for public access

Automated Schedule#

The clean backup process runs automatically:

Schedule: 2:00 AM on the 1st day of each month
Trigger: AWS Fargate scheduled task (cron-based)
Duration: Varies based on database size (typically 15-45 minutes)

What Gets Sanitized#

The cleaning process protects sensitive information by overwriting:

In the stewards table:

✅ Usernames → Random 10-character strings
✅ Passwords → Random 10-character strings
✅ Email addresses → Anonymized
✅ Physical addresses → Anonymized
✅ Phone numbers → Anonymized
✅ Fax numbers → Anonymized

What is preserved:

✅ All scientific data tables
✅ Complete schema structure
✅ Relationships and foreign keys
✅ All non-personal metadata

What is excluded (for performance):

❌ ts.hydralakes - Large spatial dataset
❌ ts.icesheets - Large spatial dataset
❌ ts.globalmammals - Large spatial dataset

📝 Note: The excluded tables are large spatial datasets primarily used for mapping. Documentation for importing these tables is maintained separately.

Accessing Clean Backups#

S3 Bucket Location:

Bucket name: neotoma-remote-store
Path: neotoma-remote-store/[files]

Available Files:

Latest Snapshot
Filename: neotoma-clean-latest.tar.gz
Always points to the most recent backup
Recommended for most users
Dated Archives
Format: neotoma-clean-MMDD.tar.gz
Examples: neotoma-clean-0105.tar.gz, neotoma-clean-0205.tar.gz
Retained for historical access
One file per month

Using a Clean Backup#

Downloading#

📝 Information Gap: Specific download URLs or AWS CLI commands needed here.

# Example download command (requires AWS CLI)
aws s3 cp s3://neotoma-remote-store/neotoma-clean-latest.tar.gz .

Restoring Locally#

Extract the archive

tar -xzf neotoma-clean-latest.tar.gz

Restore to PostgreSQL

pg_restore -h localhost -U postgres -d neotoma [extracted-file]

Install required extensions

CREATE EXTENSION IF NOT EXISTS postgis;
CREATE EXTENSION IF NOT EXISTS postgis_topology;
-- Additional extensions as needed

Technical Implementation Details#

🔧 For System Administrators: The following section describes the internal workings of the automated backup system.

Docker Container Workflow#

The backup process runs in a custom Docker container that:

Installs dependencies
PostgreSQL client tools
PostGIS extensions
OpenSSH, curl, netcat utilities
Sets up directory structure

/app/        - Application scripts
/archives/   - Temporary storage for dumps
/assets/     - Supporting files
/logs/       - Execution logs

Executes backup sequence
Starts PostgreSQL service
Waits for database readiness (with timeout logic)
Runs the scrubbing script
Creates compressed archive
Uploads to S3
Cleans up temporary files

Scrubbing Process Details#

The scrub_database.sh script performs the following operations:

Phase 1: Initial Dump

# Dumps the database excluding large spatial tables and system schemas
pg_dump -Fc -h [host] -U [user] \
  --exclude-table-data=ts.hydralakes \
  --exclude-table-data=ts.icesheets \
  --exclude-table-data=ts.globalmammals \
  --exclude-schema=cron \
  -f temp_dump.backup

Phase 2: Local Restoration

Drops and recreates local database
Installs PostGIS extensions
Restores from temporary dump

Phase 3: Data Sanitization

-- Overwrites sensitive steward information
UPDATE stewards 
SET username = [random_string],
    password = [random_string],
    email = [anonymized],
    address = [anonymized],
    phone = [anonymized],
    fax = [anonymized];

Phase 4: Final Export and Upload

Creates final compressed dump
Uploads to S3 with two names:
neotoma-clean-latest.tar.gz (overwrites previous)
neotoma-clean-MMDD.tar.gz (date-stamped archive)

Monitoring and Logging#

Log Files:

The container generates detailed logs at each step, viewable through:

AWS Batch job logs
CloudWatch Logs (if configured)

Log Content:

🔍 User context and permissions
🔍 PostgreSQL data directory location
🔍 Database connection status
🔍 Dump progress indicators
🔍 Sanitization confirmation
🔍 Upload status
🔍 Cleanup completion

Monitoring the Job:

Access AWS Batch console
Navigate to Jobs
Find the scheduled backup job
View logs for execution details

Build and Deployment#

📝 For Developers: Updates to the backup system

The backup system is maintained in the clean_database GitHub repository.

Making Changes:

Modify scripts in the repository:
Dockerfile - Container definition
scrub_database.sh - Main backup logic
batch_entrypoint.sh - Container entry point
connect_database.sh - Database connection utilities
Build and push to ECR:

# Run the build script
./build_and_push.sh

This script:

Builds the Docker image
Tags with appropriate version
Pushes to AWS Elastic Container Registry (ECR)
Updates the Batch job definition
Test the job:
Manually trigger through AWS Batch console
Monitor logs for errors
Verify S3 upload
Test restoration of generated backup

GitHub Actions:

📝 Information Gap: Clarification needed on whether GitHub Actions are used for automated deployment of changes to the backup system.

Choosing the Right Backup Method#

Scenario	Recommended Method	Rationale
Before bulk data upload	AWS Server Snapshot	Full recovery point including all databases
Before schema changes	AWS Server Snapshot	Complete rollback capability
Refresh staging environment	SQL Holding Tank Restore	Quick, targeted update of test data
External developer setup	Clean Database Repository	Safe, sanitized data with no sensitive info
Monthly public release	Clean Database Repository	Automated, consistent public distribution
Disaster recovery	AWS Server Snapshot	Complete system restoration
Local development setup	Clean Database Repository	No sensitive data, manageable size

Quick Reference Commands#

Check Database Connections#

-- View active connections to a specific database
SELECT * FROM pg_stat_activity 
WHERE datname = 'neotomaholdingtank';

Connect to System Database#

# Using psql
psql -h [hostname] -U [username] -d postgres

Download Latest Clean Backup#

# Using AWS CLI
aws s3 cp s3://neotoma-remote-store/neotoma-clean-latest.tar.gz .

Troubleshooting#

Common Issues#

Snapshot Creation Fails:

Verify IAM permissions include rds:CreateDBSnapshot
Check that instance is in "available" state
Ensure snapshot name doesn't already exist

Cannot Drop Holding Tank:

Verify connection to postgres database (not neotoma or neotomaholdingtank)
Check for active connections using pg_stat_activity
Ensure user has DROP DATABASE privileges

Clean Backup Job Fails:

Check AWS Batch job logs in CloudWatch
Verify ECR image is accessible
Confirm database connection credentials are current
Check S3 bucket permissions

Restore Takes Too Long:

Expected for large databases
Monitor with pg_stat_activity
Check disk I/O and available storage

Additional Resources#

📝 Information Gaps: The following sections need additional documentation:

Local Database Replication: Script for maintaining local development copies
Large Spatial Tables: Import procedures for hydralakes, icesheets, and globalmammals
Detailed Restoration Procedures: Step-by-step AWS snapshot restoration
Permission Requirements: Specific IAM policies and database roles needed
Monitoring Dashboard: AWS CloudWatch metrics for backup health

Document Maintenance#

Last Updated: [Date to be added]
Document Owner: [To be specified]
Review Schedule: Quarterly or after major infrastructure changes

This documentation is maintained as part of the Neotoma database project. For questions or clarifications, please contact the database administration team.