Skip to main content
Defines how persistent data is stored, backed up, and restored in production environments. Proper backup and disaster recovery procedures are essential for business continuity and data protection. Key Objectives:
  • Protect against data loss from hardware failures, human errors, or disasters
  • Enable point-in-time recovery for compliance and operational requirements
  • Minimize Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
  • Validate backup integrity through regular restore testing

Volume layout

Docker volumes provide persistent storage for stateful services. Each service stores data at specific mount points within containers:
ServiceDefault pathData stored
TiKV/dataDistributed key-value data, Raft logs
PD/dataCluster metadata, timestamp oracle
Kafka/var/lib/kafka/dataMessage logs, topic partitions
Redis/dataCache data, session state
MongoDB/data/dbDocument collections, indexes
Storage requirements:
  • All persistent volumes should be backed by SSD or NVMe storage for production deployments
  • Provision adequate IOPS for database workloads (minimum 3000 IOPS for TiKV)
  • Monitor disk space usage and set alerts at 75% capacity
  • Plan for 30-50% growth buffer beyond current usage

Backup strategy

Implement automated, regular backups for all stateful services with appropriate retention policies:

TiDB backups

Frequency: Daily full backups, hourly incremental backups (for critical deployments) Method: Use TiDB BR (Backup & Restore) tool for consistent cluster snapshots
# Full backup
tiup br backup full \
  --pd <pd-address>:2379 \
  --storage "local:///backup/$(date +%Y%m%d)" \
  --ratelimit 120 \
  --log-file backup.log

# Incremental backup
tiup br backup incremental \
  --pd <pd-address>:2379 \
  --storage "local:///backup/incremental/$(date +%Y%m%d_%H%M)" \
  --lastbackupts <last-backup-timestamp>
Storage: Secure, off-cluster storage (S3, NFS, or dedicated backup server) Retention: 30 days for daily backups, 90 days for monthly backups (adjust based on compliance requirements)

Kafka backups

Frequency: Weekly segment-level backups Method: Copy Kafka data directories or use MirrorMaker for replication to backup cluster
# Stop Kafka broker (if taking offline backup)
docker service scale kafka=0

# Backup Kafka data directory
tar -czf kafka-backup-$(date +%Y%m%d).tar.gz /var/lib/kafka/data

# Restart Kafka broker
docker service scale kafka=1
⚠️ Warning: Stopping Kafka brokers will interrupt message delivery. Perform offline backups only during maintenance windows. Retention: 4 weeks (Kafka data is typically transient with configurable retention)

MongoDB backups

Frequency: Daily backups Method: Use mongodump for logical backups or filesystem snapshots for physical backups
# Logical backup with mongodump
# Ensure mongodump uses a read-only backup user with minimal privileges
docker exec mongodb mongodump \
  --out /backup/mongodb-$(date +%Y%m%d) \
  --gzip

# Copy backup to secure storage
docker cp mongodb:/backup/mongodb-$(date +%Y%m%d) /secure/backup/location/
Retention: 30 days for daily backups, 1 year for monthly backups

Redis backups

Frequency: RDB snapshots every 6 hours Method: Redis automatically creates RDB snapshots based on configuration
# Trigger manual snapshot
docker exec redis redis-cli BGSAVE

# Copy RDB file to backup location
docker cp redis:/data/dump.rdb /backup/redis-$(date +%Y%m%d_%H%M).rdb
Note: Redis data is non-authoritative and can be safely rebuilt from TiDB and Kafka in most scenarios. Backups are primarily for faster recovery rather than data preservation. Retention: 7 days (cache data has short-term value)

Backup validation

Monthly restore tests: Perform full restore to staging environment to verify backup integrity
# Example: Restore TiDB backup
tiup br restore full \
  --pd <staging-pd-address>:2379 \
  --storage "local:///backup/20240119" \
  --log-file restore.log

# Verify data integrity
# Run application smoke tests against restored data
Validation checklist:
  • Backup files are complete and not corrupted
  • Restore process completes without errors
  • Data integrity checks pass (row counts, checksums)
  • Application can connect and query restored data
  • Restore time meets RTO requirements

Disaster recovery

Establish comprehensive disaster recovery procedures to ensure business continuity in the event of catastrophic failures.

Recovery objectives

Define clear recovery targets based on business requirements:
  • RTO (Recovery Time Objective): Maximum acceptable downtime (typically 1-4 hours for production systems)
  • RPO (Recovery Point Objective): Maximum acceptable data loss (typically 1-24 hours depending on backup frequency)
Note: Actual RTO/RPO depends on backup size, network bandwidth, and restore automation maturity. Test and validate your specific recovery times.

Disaster recovery procedures

Full cluster restoration from backups:
  1. Provision new infrastructure matching production specifications
  2. Restore data stores in dependency order:
    • TiDB/TiKV (primary data store) - Restore TiDB first as it is the authoritative source of user, conversation, and message metadata
    • MongoDB (metadata and configuration)
    • Kafka (if message history is critical)
    • Redis (optional, can rebuild from primary data)
  3. Restore application services and verify connectivity
  4. Validate data integrity through application smoke tests
  5. Update DNS to point to new cluster
  6. Monitor closely for 24-48 hours post-recovery
Example TiDB restore:
# Restore TiDB cluster from backup
tiup br restore full \
  --pd <new-pd-address>:2379 \
  --storage "s3://backup-bucket/20240119" \
  --log-file restore.log

# Verify cluster health
tiup cluster display <cluster-name>

# Run data integrity checks
# Check row counts, run application queries

Geographic redundancy

Backup storage locations:
  • Maintain a minimum of three geographically isolated backup copies
  • Primary backup: Same region as production (fast recovery)
  • Secondary backup: Different region (regional disaster protection)
  • Tertiary backup: Different cloud provider or on-premiseises (provider-level disaster protection)
Important: Ensure backups are completed and verified before object storage lifecycle rules expire older snapshots. Replication strategies:
  • Use cloud storage replication (S3 cross-region replication, GCS multi-region)
  • Implement backup verification at each location
  • Test restore from each backup location quarterly

Disaster recovery testing

Quarterly DR simulations: Run staged disaster recovery exercises to validate procedures and train teams:
  1. Warm-standby restoration: Restore to standby environment, validate without cutting over
  2. Full cluster rehydration: Complete restore from backups in isolated environment
  3. Failover testing: Practice DNS cutover and traffic migration procedures
  4. Rollback testing: Validate ability to roll back to previous state if needed
DR drill checklist:
  • Backup files are accessible from all locations
  • Restore procedures are documented and up-to-date
  • Team members know their roles and responsibilities
  • Communication channels are established
  • Restore time meets RTO requirements
  • Data integrity is validated post-restore
  • Application functionality is verified
  • Lessons learned are documented and procedures updated

Backup security

Encryption:
  • Encrypt backups at rest using AES-256 or equivalent
  • Encrypt backups in transit using TLS
  • Store encryption keys separately from backup data (use key management service)
Access control:
  • Restrict backup access to authorized personnel only
  • Use separate credentials for backup operations
  • Audit all backup access and modifications
  • Implement multi-factor authentication for backup system access