Skip to main content
Disaster recovery and backup infrastructure with redundancy systems
Cybersecurity

Backup & Disaster Recovery: Complete Business Continuity Guide

Cesar Adames
•

Implement comprehensive backup and disaster recovery strategies to ensure business continuity, minimize downtime, and protect critical data from loss.

#disaster-recovery #backup-strategy #business-continuity #data-protection #resilience

Backup & Disaster Recovery: Complete Business Continuity Guide

Organizations experience an average of 2 hours downtime per month costing $300K per hour. Proper backup and disaster recovery reduces downtime by 75% and ensures business continuity during incidents, achieving 99.9%+ uptime.

DR Fundamentals

Key Concepts

Recovery Point Objective (RPO): Maximum acceptable data loss

RPO = Time between last backup and disaster
- 24 hours = Daily backups acceptable
- 1 hour = Hourly backups or continuous replication
- 0 minutes = Synchronous replication required

Recovery Time Objective (RTO): Maximum acceptable downtime

RTO = Time to restore service
- 72 hours = Basic recovery
- 4 hours = Standard business
- 1 hour = Critical systems
- Minutes = Mission-critical (HA required)

Disaster Recovery Tiers:

Tier 0: No offsite backup (0% recovery probability)
Tier 1: Data backup to tape (Days to restore)
Tier 2: Data backup to disk (Hours to restore)
Tier 3: Electronic vaulting (1-24 hours)
Tier 4: Active secondary site (Minutes to hours)
Tier 5: Zero data loss (Real-time, <1 minute)

Backup Strategy

3-2-1 Rule:

3 - Three copies of data
2 - Two different media types
1 - One copy offsite

Modern: 3-2-1-1-0
3 copies
2 media types
1 offsite
1 offline (air-gapped)
0 errors (verified)

Backup Types:

Full Backup:

  • Complete copy of all data
  • Longest backup time
  • Fastest restore
  • Most storage required
  • Weekly/monthly cadence

Incremental Backup:

  • Only changed since last backup
  • Fastest backup
  • Slower restore (need all increments)
  • Least storage
  • Daily cadence

Differential Backup:

  • Changed since last full backup
  • Moderate backup time
  • Moderate restore (need full + last differential)
  • Moderate storage
  • Daily cadence

Example Schedule:

Sunday: Full backup
Monday-Saturday: Incremental backups
Retention: 30 days online, 7 years archive

Backup Implementation

File System Backups

Linux (rsync):

#!/bin/bash
# Automated backup script

SOURCE="/data"
DEST="/backup/$(date +%Y%m%d)"
LOG="/var/log/backup.log"

# Create incremental backup
rsync -avz --delete \
  --link-dest=/backup/latest \
  $SOURCE $DEST >> $LOG 2>&1

# Update latest symlink
ln -nfs $DEST /backup/latest

# Retention (keep 30 days)
find /backup -type d -mtime +30 -exec rm -rf {} \;

Windows (robocopy):

# Backup script
$Source = "C:\Data"
$Destination = "D:\Backup\$(Get-Date -Format 'yyyyMMdd')"
$Log = "C:\Logs\backup.log"

robocopy $Source $Destination /MIR /R:3 /W:10 /LOG:$Log /TEE

Database Backups

MySQL:

#!/bin/bash
# MySQL backup script

BACKUP_DIR="/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
DATABASES="production staging"

for DB in $DATABASES; do
  mysqldump --single-transaction \
    --quick \
    --lock-tables=false \
    --routines \
    --triggers \
    --events \
    $DB | gzip > $BACKUP_DIR/${DB}_${DATE}.sql.gz
done

# Point-in-time recovery (binary logs)
mysqlbinlog --stop-datetime="2025-10-07 14:30:00" \
  /var/log/mysql/mysql-bin.000001 > restore.sql

PostgreSQL:

# Full database backup
pg_dump -Fc database_name > backup.dump

# All databases
pg_dumpall > all_databases.sql

# Continuous archiving (PITR)
archive_command = 'cp %p /backup/wal/%f'

MongoDB:

# Backup
mongodump --uri="mongodb://localhost:27017/mydb" --out=/backup

# Restore
mongorestore --uri="mongodb://localhost:27017" /backup

Cloud Backups

AWS Backup:

import boto3

backup = boto3.client('backup')

# Create backup plan
backup.create_backup_plan(
    BackupPlan={
        'BackupPlanName': 'DailyBackup',
        'Rules': [{
            'RuleName': 'DailyRule',
            'TargetBackupVaultName': 'Default',
            'ScheduleExpression': 'cron(0 2 * * ? *)',
            'StartWindowMinutes': 60,
            'CompletionWindowMinutes': 120,
            'Lifecycle': {
                'DeleteAfterDays': 30
            }
        }]
    }
)

Disaster Recovery Planning

DR Site Strategies

Cold Site:

  • Empty facility with power/cooling
  • No equipment pre-installed
  • Lowest cost
  • Longest recovery time (weeks)
  • RTO: Days to weeks

Warm Site:

  • Partial infrastructure
  • Some equipment ready
  • Moderate cost
  • Medium recovery time
  • RTO: Hours to days

Hot Site:

  • Fully equipped and operational
  • Real-time replication
  • Highest cost
  • Fastest recovery
  • RTO: Minutes to hours

Failover Architectures

Active-Passive:

Primary Site (Active) ──replication──> DR Site (Passive)

Normal: All traffic to primary
Disaster: Manual/automatic failover to DR

Benefits:
- Cost effective
- Simple design
- Clear failover process

Challenges:
- DR resources idle
- Failover testing complex
- Data synchronization

Active-Active:

Site A (Active) <──sync──> Site B (Active)

Normal: Traffic distributed across both
Disaster: Remaining site handles all load

Benefits:
- Resources always utilized
- Automatic failover
- Better performance

Challenges:
- Higher complexity
- Data consistency
- More expensive

Replication Technologies

Storage Replication:

Synchronous:
- Zero data loss (RPO = 0)
- Performance impact
- Distance limited (&lt;100km)
- Mission-critical systems

Asynchronous:
- Some data loss possible
- Minimal performance impact
- Unlimited distance
- Standard systems

Database Replication:

-- MySQL Master-Slave Replication
-- Master configuration
[mysqld]
server-id=1
log-bin=mysql-bin
binlog-do-db=production

-- Slave configuration
[mysqld]
server-id=2
relay-log=mysql-relay-bin
read-only=1

Application-Level Replication:

# Dual-write pattern
def save_user(user_data):
    try:
        # Write to primary
        primary_db.save(user_data)
        
        # Write to DR
        dr_db.save(user_data)
    except Exception as e:
        # Rollback and alert
        handle_replication_failure(e)

Testing & Validation

Backup Testing

Restore Testing:

#!/bin/bash
# Automated restore test

# Test environment
TEST_SERVER="test-restore-01"

# Restore latest backup
LATEST_BACKUP=$(ls -t /backup/*.tar.gz | head -1)
ssh $TEST_SERVER "tar -xzf $LATEST_BACKUP -C /restore"

# Verify data integrity
ssh $TEST_SERVER "md5sum -c /restore/checksums.md5"

# Application smoke test
ssh $TEST_SERVER "/restore/scripts/smoke_test.sh"

# Report results
send_test_report "Restore test completed successfully"

Backup Verification:

Automated checks:
- Backup job completion
- File integrity (checksums)
- Data consistency
- Backup size validation
- Encryption verification

Schedule:
- Daily: Backup completion check
- Weekly: Sample restore test
- Monthly: Full restore test
- Quarterly: DR drill

Disaster Recovery Drills

Tabletop Exercise:

  • Scenario walkthrough
  • Team discussion
  • Identify gaps
  • Update procedures
  • No actual failover

Simulated Disaster:

  • Announce test scenario
  • Execute DR procedures
  • Monitor recovery process
  • Document issues
  • Measure against RTO/RPO

Unannounced Drill:

  • Real-world simulation
  • Test true readiness
  • Identify surprises
  • Validate procedures
  • Build muscle memory

Business Continuity

BCP Components

Business Impact Analysis (BIA):

Identify critical functions:
1. Revenue-generating systems
2. Customer-facing services
3. Compliance requirements
4. Dependencies and priorities

Assess impact:
- Financial loss per hour
- Reputation damage
- Regulatory penalties
- Customer attrition
- Market share loss

Recovery Strategies:

Technology:
- Redundant systems
- Cloud failover
- Data replication
- Backup power

People:
- Emergency contacts
- Succession planning
- Remote work capability
- Cross-training

Processes:
- Documented procedures
- Communication plans
- Vendor management
- Regular testing

Incident Response Integration

DR Activation Criteria:

  • Critical system failure
  • Data center outage
  • Cyber attack (ransomware)
  • Natural disaster
  • Extended outage expected

Escalation Path:

Level 1: IT Operations (assess situation)
Level 2: IT Management (declare incident)
Level 3: Crisis Management Team (activate DR)
Level 4: Executive Team (business decisions)

Cloud DR Solutions

AWS Disaster Recovery:

# Pilot Light architecture
# Minimal DR environment always running

# Normal state: Core AMIs and databases replicated

# Disaster activation:
import boto3

ec2 = boto3.client('ec2')
autoscaling = boto3.client('autoscaling')

# Scale up DR environment
autoscaling.set_desired_capacity(
    AutoScalingGroupName='dr-web-asg',
    DesiredCapacity=10  # Scale from 2 to 10
)

# Update Route53 for failover
route53 = boto3.client('route53')
route53.change_resource_record_sets(
    HostedZoneId='Z123',
    ChangeBatch={
        'Changes': [{
            'Action': 'UPSERT',
            'ResourceRecordSet': {
                'Name': 'app.example.com',
                'Type': 'A',
                'SetIdentifier': 'DR-Site',
                'Failover': 'PRIMARY',
                'AliasTarget': {
                    'HostedZoneId': 'Z456',
                    'DNSName': 'dr-elb.amazonaws.com',
                    'EvaluateTargetHealth': True
                }
            }
        }]
    }
)

Monitoring & Alerting

Backup Monitoring:

Monitor:
- Backup job status
- Backup duration trends
- Storage capacity
- Backup size anomalies
- Failed backups

Alert on:
- Backup failures
- Backup time exceeded SLA
- Storage threshold (80%)
- Integrity check failures
- Replication lag

DR Readiness:

Track:
- Last successful DR test
- RTO/RPO compliance
- Replication status
- Site availability
- Recovery procedures currency

Dashboard metrics:
- Days since last DR drill
- Backup success rate (target: 99%+)
- Average restore time
- Data loss (RPO adherence)
- Test success rate

Compliance Requirements

Regulatory Standards:

SOC 2:

  • Documented backup procedures
  • Regular testing
  • Incident response
  • Change management
  • Availability commitments

HIPAA:

  • Data backup plan
  • Disaster recovery plan
  • Emergency mode operations
  • Testing and revision
  • Business continuity

PCI DSS:

  • Backup data encrypted
  • Annual testing
  • Secure storage
  • Retention policy
  • Incident response

Getting Started

Month 1: Assessment

  • Identify critical systems
  • Define RTO/RPO requirements
  • Document dependencies
  • Assess current capabilities
  • Calculate costs

Month 2: Implementation

  • Deploy backup solution
  • Configure replication
  • Establish DR site
  • Document procedures
  • Train team

Month 3: Validation

  • Test backup restores
  • Conduct DR drill
  • Measure RTO/RPO
  • Refine procedures
  • Executive review

Conclusion

Backup and disaster recovery are essential for business continuity and resilience. Proper implementation minimizes downtime, protects data, and ensures regulatory compliance.

Success requires clear RTO/RPO targets, regular testing, documented procedures, and continuous improvement. Invest in automation, monitoring, and team preparedness.

Next Steps:

  1. Conduct business impact analysis
  2. Define RTO/RPO requirements
  3. Implement backup strategy
  4. Design DR architecture
  5. Test and refine continuously

Ready to Transform Your Business?

Let's discuss how our AI and technology solutions can drive revenue growth for your organization.