System Recovery and Emergency Procedures
Danger
EMERGENCY GUIDE: This document provides critical recovery procedures for system failures. Bookmark this page for quick access during emergencies.
This guide covers emergency recovery procedures, system restart protocols, backup and restore operations, and troubleshooting for critical failures.
—
Quick Recovery Steps
When the System is Down
Fast Recovery Checklist (5 minutes):
# 1. Stop all services gracefully
docker-compose -f docker-compose.prod.yml down
# 2. Check system health
docker ps -a # See all containers
docker logs lustores_app # Check app logs
docker logs lustores_db # Check database logs
docker logs lustores_nginx # Check nginx logs
# 3. Restart services in proper order
# DATABASE FIRST (wait 30 seconds for init)
docker-compose -f docker-compose.prod.yml up -d db
sleep 30
# APP AND REDIS (wait 15 seconds)
docker-compose -f docker-compose.prod.yml up -d app redis
sleep 15
# NGINX LAST
docker-compose -f docker-compose.prod.yml up -d nginx
# 4. Verify health
curl http://localhost/health
curl https://yourdomain.com/health
If this doesn’t work: Proceed to specific failure scenarios below.
Health Check Commands
Quick Status Check:
# Check which services are running
docker-compose -f docker-compose.prod.yml ps
# Check Docker daemon
sudo systemctl status docker
# Check disk space (common issue)
df -h
# Check memory usage
free -h
# Check logs for errors
docker-compose -f docker-compose.prod.yml logs --tail=50
Expected Healthy Output:
NAME STATUS PORTS
lustores_db Up 5 minutes 0.0.0.0:5432->5432/tcp
lustores_app Up 4 minutes 0.0.0.0:5000->5000/tcp
lustores_redis Up 4 minutes 0.0.0.0:6379->6379/tcp
lustores_nginx Up 4 minutes 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp
—
Common Failure Scenarios
Scenario 1: Database Won’t Start
- Symptoms:
lustores_dbcontainer exits immediatelyLogs show: “database files are incompatible” or “could not open file”
App can’t connect to database
Diagnosis:
# Check database logs
docker logs lustores_db
# Common error messages:
# - "FATAL: database files are incompatible with server"
# - "FATAL: could not create shared memory segment"
# - "data directory has wrong ownership"
Solutions:
Solution A: Volume Corruption (if logs show incompatibility):
# DANGER: This deletes ALL data. Restore from backup after.
docker-compose -f docker-compose.prod.yml down
docker volume rm lustores_postgres_data
# Restore from backup (see Database Backup section below)
docker-compose -f docker-compose.prod.yml up -d db
sleep 30
cat backup-latest.sql | docker exec -i lustores_db psql -U postgres inventory
Solution B: Permissions Issue:
# Fix volume permissions
docker-compose -f docker-compose.prod.yml down
sudo chown -R 999:999 /var/lib/docker/volumes/lustores_postgres_data
docker-compose -f docker-compose.prod.yml up -d db
Solution C: PostgreSQL Version Mismatch:
# Check current PostgreSQL version
docker exec lustores_db psql -U postgres -c "SELECT version();"
# If version mismatch, upgrade using pg_upgrade
# See PostgreSQL upgrade documentation
- Prevention:
Daily backups: Automated backup script (see Backup section)
Monitor disk space: Ensure adequate space for database growth
Version pinning: Lock PostgreSQL version in docker-compose.prod.yml
Scenario 2: App Container Crash-Looping
- Symptoms:
lustores_appcontainer restarts repeatedlyLogs show connection errors or startup failures
HTTP 502 Bad Gateway from nginx
Diagnosis:
# Watch app logs in real-time
docker logs -f lustores_app
# Common error patterns:
# - "ECONNREFUSED" → Can't connect to database/redis
# - "MODULE_NOT_FOUND" → Missing dependencies
# - "EADDRINUSE" → Port already in use
# - "Segmentation fault" → Node.js crash (serious)
Solutions:
Solution A: Database Connection Failure:
# Verify database is running and healthy
docker exec lustores_db psql -U postgres -c "SELECT 1;"
# Check DATABASE_URL environment variable
docker-compose -f docker-compose.prod.yml config | grep DATABASE_URL
# Ensure correct format:
# DATABASE_URL=postgresql://postgres:PASSWORD@db:5432/inventory
Solution B: Missing Environment Variables:
# Check .env.prod file exists and is complete
cat .env.prod
# Required variables:
# - DATABASE_URL
# - SESSION_SECRET
# - JWT_SECRET
# - DB_PASSWORD
# - DOMAIN
# - EMAIL
# Restart app after fixing .env.prod
docker-compose -f docker-compose.prod.yml up -d app
Solution C: Dependency Issue:
# Rebuild app image with fresh dependencies
docker-compose -f docker-compose.prod.yml build --no-cache app
docker-compose -f docker-compose.prod.yml up -d app
Solution D: Port Conflict:
# Check if port 5000 is in use
sudo lsof -i :5000
# Kill conflicting process or change app port
- Prevention:
Health checks: Monitor
/healthendpointStructured logging: Review logs regularly for warnings
Test deployments: Staging environment before production
Scenario 3: Nginx 502 Bad Gateway
- Symptoms:
Website returns “502 Bad Gateway”
Nginx is running but can’t reach app
App container is healthy but unreachable
Diagnosis:
# Check nginx logs
docker logs lustores_nginx
# Common errors:
# - "connect() failed (111: Connection refused)"
# - "no resolver defined to resolve app"
# - "upstream timed out"
Solutions:
Solution A: DNS Resolution Issue (most common with Watchtower):
# Nginx can't resolve "app" hostname after Watchtower update
# FIX: Restart nginx to refresh DNS cache
docker-compose -f docker-compose.prod.yml restart nginx
Solution B: App Not Ready:
# App still starting up
# Wait 30 seconds and retry
sleep 30
curl http://localhost/health
Solution C: Nginx Configuration Error:
# Test nginx configuration
docker exec lustores_nginx nginx -t
# If config invalid, check nginx.conf
docker exec lustores_nginx cat /etc/nginx/nginx.conf
# Fix configuration and reload
docker-compose -f docker-compose.prod.yml restart nginx
Solution D: Network Issue:
# Check if app and nginx on same Docker network
docker network inspect lustores_network
# Both should be listed in "Containers"
- Prevention:
Dynamic DNS: nginx.conf already configured with
resolver 127.0.0.11for Docker DNSHealth checks: Nginx waits for app to be healthy before routing
Monitoring: Regular health endpoint checks
Scenario 4: SSL Certificate Expired
- Symptoms:
Browser shows “Your connection is not private”
Certificate expired warning
HTTPS doesn’t work, HTTP does
Diagnosis:
# Check certificate expiry
docker exec lustores_certbot certbot certificates
# Output shows:
# Expiry Date: 2025-01-01 (EXPIRED)
Solutions:
Solution A: Manual Renewal:
# Force certificate renewal
docker-compose -f docker-compose.prod.yml run --rm certbot certonly \
--webroot \
--webroot-path=/var/www/certbot \
--email your-email@university.edu \
--agree-tos \
--no-eff-email \
--force-renewal \
-d yourdomain.com
# Reload nginx to use new certificate
docker-compose -f docker-compose.prod.yml exec nginx nginx -s reload
Solution B: Fix Auto-Renewal:
# Certbot auto-renewal runs every 12 hours via certbot service
# Check certbot service is running
docker-compose -f docker-compose.prod.yml ps certbot
# Check certbot logs
docker logs lustores_certbot
# Ensure certbot service is configured for renewal
docker-compose -f docker-compose.prod.yml restart certbot
Solution C: Domain Verification Failed:
# Let's Encrypt needs to verify domain ownership via HTTP
# Ensure /.well-known/acme-challenge/ accessible
# Test HTTP access (nginx must allow this path)
curl http://yourdomain.com/.well-known/acme-challenge/test
# Should return 404 (not 502 or connection refused)
- Prevention:
Certbot auto-renewal: Already configured in
docker-compose.prod.ymlMonitoring: Check certificate expiry monthly
Alerts: Set up reminder 30 days before expiry
Scenario 5: Watchtower Updated and Broke Something
- Symptoms:
System was working, suddenly broken after Watchtower update
New Docker image deployed with bugs
Need to roll back to previous version
Diagnosis:
# Check Watchtower logs
docker logs lustores_watchtower
# Find recent update timestamp
# Check app logs for errors after that time
docker logs lustores_app --since="2025-01-29T10:00:00"
Solutions:
Solution A: Roll Back to Previous Image:
# 1. Stop current containers
docker-compose -f docker-compose.prod.yml down
# 2. List recent image versions
docker images lustores/app --format "{{.ID}}\t{{.CreatedAt}}\t{{.Tag}}"
# Output:
# abc123def456 2025-01-29 10:00:00 latest ← Current (broken)
# xyz789ghi012 2025-01-28 14:30:00 latest ← Previous (working)
# 3. Tag previous image as latest
docker tag xyz789ghi012 lustores/app:latest
# 4. Restart services
docker-compose -f docker-compose.prod.yml up -d
Solution B: Disable Watchtower Temporarily:
# Stop Watchtower to prevent further updates
docker-compose -f docker-compose.prod.yml stop watchtower
# Fix the issue manually
# Re-enable Watchtower when ready
docker-compose -f docker-compose.prod.yml start watchtower
Solution C: Pin Specific Image Version:
# Edit docker-compose.prod.yml
# Change:
# image: lustores/app:latest
# To:
# image: lustores/app:xyz789ghi012 # Specific SHA or tag
# Restart
docker-compose -f docker-compose.prod.yml up -d app
- Prevention:
Staging environment: Test updates before production
Manual updates: Disable Watchtower, update manually after testing
Image tagging: Use semantic versioning (v1.0.0, v1.0.1) instead of
latestRollback plan: Always keep previous 3 images
Scenario 6: Disk Space Full
- Symptoms:
Services failing randomly
Database can’t write
Logs show “No space left on device”
Diagnosis:
# Check disk usage
df -h
# Output shows:
# /dev/sda1 50G 49G 0G 100% /
# Check Docker disk usage
docker system df
Solutions:
Solution A: Clean Old Docker Resources:
# Remove stopped containers
docker container prune -f
# Remove unused images (keep recent ones)
docker image prune -a --filter "until=168h" # Older than 7 days
# Remove unused volumes (CAREFUL - may delete data)
docker volume prune -f
# Remove unused networks
docker network prune -f
# Full cleanup (DANGEROUS - removes ALL unused resources)
docker system prune -a --volumes -f
Solution B: Clean Application Logs:
# Truncate Docker container logs
sudo truncate -s 0 /var/lib/docker/containers/*/*-json.log
# Or limit log size in docker-compose.prod.yml:
# logging:
# options:
# max-size: "10m"
# max-file: "3"
Solution C: Expand Disk:
# For cloud VMs: Expand disk via provider console, then:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
- Prevention:
Monitoring: Set up disk space alerts at 80% usage
Log rotation: Configure Docker log limits
Regular cleanup: Weekly
docker system prunecron job
—
Database Backup and Restore
Creating Backups
Manual Backup (run before major changes):
# Create backup with timestamp
docker exec lustores_db pg_dump -U postgres inventory > backup-$(date +%Y%m%d-%H%M%S).sql
# With compression (recommended for large databases)
docker exec lustores_db pg_dump -U postgres inventory | gzip > backup-$(date +%Y%m%d-%H%M%S).sql.gz
# Verify backup created
ls -lh backup-*.sql.gz
Automated Daily Backups:
Create /root/scripts/backup-database.sh:
#!/bin/bash
# Daily database backup script
BACKUP_DIR="/backups/lustores"
DATE=$(date +%Y%m%d)
KEEP_DAYS=7
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Create backup
docker exec lustores_db pg_dump -U postgres inventory | \
gzip > "$BACKUP_DIR/backup-$DATE.sql.gz"
# Delete backups older than KEEP_DAYS
find "$BACKUP_DIR" -name "backup-*.sql.gz" -mtime +$KEEP_DAYS -delete
# Log result
echo "$(date): Backup completed - backup-$DATE.sql.gz" >> /var/log/lustores-backup.log
Schedule with Cron:
# Edit crontab
sudo crontab -e
# Add daily backup at 2 AM
0 2 * * * /root/scripts/backup-database.sh
Restoring from Backup
Full Database Restore:
# 1. Stop application (prevents new writes)
docker-compose -f docker-compose.prod.yml stop app
# 2. Drop existing database (DANGER!)
docker exec lustores_db psql -U postgres -c "DROP DATABASE IF EXISTS inventory;"
# 3. Create fresh database
docker exec lustores_db psql -U postgres -c "CREATE DATABASE inventory;"
# 4. Restore from backup
gunzip < backup-20250129.sql.gz | docker exec -i lustores_db psql -U postgres inventory
# OR without compression:
cat backup-20250129.sql | docker exec -i lustores_db psql -U postgres inventory
# 5. Restart application
docker-compose -f docker-compose.prod.yml start app
Verify Restore:
# Check database size
docker exec lustores_db psql -U postgres inventory -c "\dt+"
# Check recent data
docker exec lustores_db psql -U postgres inventory -c "SELECT COUNT(*) FROM items;"
# Test application
curl http://localhost/health
Partial Restore (specific table):
# Extract single table from backup
docker exec lustores_db pg_restore -U postgres -d inventory -t items backup.dump
Backup Best Practices
3-2-1 Rule: - 3 copies of data (original + 2 backups) - 2 different storage media (local disk + cloud storage) - 1 off-site backup (cloud or remote server)
Test Restores Monthly: - Verify backups are not corrupted - Practice restore procedure - Time how long restore takes
Retention Policy: - Daily backups: Keep 7 days - Weekly backups: Keep 4 weeks - Monthly backups: Keep 12 months
Encryption (for sensitive data):
# Encrypt backup docker exec lustores_db pg_dump -U postgres inventory | \ gzip | \ openssl enc -aes-256-cbc -salt -out backup-encrypted.sql.gz.enc # Decrypt for restore openssl enc -d -aes-256-cbc -in backup-encrypted.sql.gz.enc | \ gunzip | \ docker exec -i lustores_db psql -U postgres inventory
—
Complete System Rebuild
When All Else Fails
Nuclear Option: Full system rebuild from backup:
# 1. Backup current state (just in case)
docker exec lustores_db pg_dump -U postgres inventory > emergency-backup.sql
# 2. Stop and remove all containers
docker-compose -f docker-compose.prod.yml down -v
# 3. Remove all volumes (DELETES ALL DATA)
docker volume rm lustores_postgres_data lustores_redis_data
# 4. Pull fresh images
docker-compose -f docker-compose.prod.yml pull
# 5. Start services
docker-compose -f docker-compose.prod.yml up -d
# 6. Wait for database initialization
sleep 60
# 7. Restore from backup
cat backup-latest.sql | docker exec -i lustores_db psql -U postgres inventory
# 8. Verify system health
curl https://yourdomain.com/health
# 9. Test login and basic functionality
—
Emergency Contacts and Escalation
Contact Tree
- Level 1 - First Response (0-15 minutes):
Check this document for solutions
Attempt quick recovery steps
Review recent logs
- Level 2 - System Administrator (15-30 minutes):
Contact: IT Admin (admin@university.edu)
Escalate if: Unable to restore service, data corruption suspected
Provide: Logs, error messages, steps attempted
- Level 3 - Infrastructure Team (30-60 minutes):
Contact: Infrastructure Team (infrastructure@university.edu)
Escalate if: Hardware failure, network issues, disk failure
Provide: Full system diagnostics
- Level 4 - Vendor Support (1+ hours):
Contact: Cloud provider support (if cloud-hosted)
Escalate if: Platform-level issues, need vendor intervention
Critical Information to Collect
Before contacting support, gather:
Timeline: - When did issue start? - What changed before issue started? - What error messages appeared?
Logs (last 100 lines):
docker logs --tail=100 lustores_app > app-logs.txt docker logs --tail=100 lustores_db > db-logs.txt docker logs --tail=100 lustores_nginx > nginx-logs.txt
System State:
docker-compose -f docker-compose.prod.yml ps > containers-status.txt df -h > disk-usage.txt free -h > memory-usage.txt
Configuration: - .env.prod file (REDACT SECRETS!) - docker-compose.prod.yml version - Recent changes (from git log or deployment records)
Post-Incident Review
After resolving major incidents:
Document What Happened: - Root cause analysis - Timeline of events - Resolution steps
Update Procedures: - Add new failure scenario to this document - Update runbooks - Create preventive measures
Improve Monitoring: - Add alerts for this failure mode - Enhance health checks - Set up dashboards
Team Review: - Share lessons learned - Update training materials - Improve response procedures
—
Monitoring and Prevention
Health Monitoring Setup
Automated Health Checks (recommended):
# Create /root/scripts/health-check.sh
#!/bin/bash
HEALTH_URL="https://yourdomain.com/health"
WEBHOOK="https://your-alerting-webhook.com"
# Check health endpoint
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
if [ "$HTTP_CODE" != "200" ]; then
# Send alert
curl -X POST "$WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"text\":\"LUStores health check failed: HTTP $HTTP_CODE\"}"
# Log failure
echo "$(date): Health check failed - HTTP $HTTP_CODE" >> /var/log/lustores-health.log
fi
Schedule Health Checks:
# Run every 5 minutes
*/5 * * * * /root/scripts/health-check.sh
Disk Space Monitoring
# Create /root/scripts/disk-check.sh
#!/bin/bash
THRESHOLD=80
USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "$(date): Disk usage at ${USAGE}% - threshold ${THRESHOLD}%"
# Send alert
fi
Scheduled disk checks:
# Every hour
0 * * * * /root/scripts/disk-check.sh
—
Quick Reference Card
Emergency Commands Cheat Sheet
# RESTART EVERYTHING
docker-compose -f docker-compose.prod.yml restart
# STOP EVERYTHING
docker-compose -f docker-compose.prod.yml down
# VIEW LOGS (REAL-TIME)
docker-compose -f docker-compose.prod.yml logs -f
# CHECK HEALTH
curl http://localhost/health
# BACKUP DATABASE NOW
docker exec lustores_db pg_dump -U postgres inventory > emergency-backup.sql
# RESTORE DATABASE
cat backup.sql | docker exec -i lustores_db psql -U postgres inventory
# FREE UP DISK SPACE
docker system prune -a -f
# REBUILD APP (FRESH START)
docker-compose -f docker-compose.prod.yml up -d --build --force-recreate app
Common Error Messages
Error Message |
Quick Fix |
|---|---|
“502 Bad Gateway” |
|
“Connection refused” |
|
“No space left” |
|
“Certificate expired” |
|
“Port already in use” |
|
“Database incompatible” |
Restore from backup (see section above) |
—
Additional Resources
External Resources
—
Important
Keep This Document Updated: When you resolve a new type of incident, add it to this guide to help future responders.
Tip
Print This Page: Keep a printed copy near your server for offline reference during network outages.