- Protect against data loss from hardware failures, human errors, or disasters
- Enable point-in-time recovery for compliance and operational requirements
- Minimize Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
- Validate backup integrity through regular restore testing
Volume layout
Docker volumes provide persistent storage for stateful services. Each service stores data at specific mount points within containers:| Service | Default path | Data stored |
|---|---|---|
| TiKV | /data | Distributed key-value data, Raft logs |
| PD | /data | Cluster metadata, timestamp oracle |
| Kafka | /var/lib/kafka/data | Message logs, topic partitions |
| Redis | /data | Cache data, session state |
| MongoDB | /data/db | Document collections, indexes |
- All persistent volumes should be backed by SSD or NVMe storage for production deployments
- Provision adequate IOPS for database workloads (minimum 3000 IOPS for TiKV)
- Monitor disk space usage and set alerts at 75% capacity
- Plan for 30-50% growth buffer beyond current usage
Backup strategy
Implement automated, regular backups for all stateful services with appropriate retention policies:TiDB backups
Frequency: Daily full backups, hourly incremental backups (for critical deployments) Method: Use TiDB BR (Backup & Restore) tool for consistent cluster snapshotsKafka backups
Frequency: Weekly segment-level backups Method: Copy Kafka data directories or use MirrorMaker for replication to backup clusterMongoDB backups
Frequency: Daily backups Method: Use mongodump for logical backups or filesystem snapshots for physical backupsRedis backups
Frequency: RDB snapshots every 6 hours Method: Redis automatically creates RDB snapshots based on configurationBackup validation
Monthly restore tests: Perform full restore to staging environment to verify backup integrity- Backup files are complete and not corrupted
- Restore process completes without errors
- Data integrity checks pass (row counts, checksums)
- Application can connect and query restored data
- Restore time meets RTO requirements
Disaster recovery
Establish comprehensive disaster recovery procedures to ensure business continuity in the event of catastrophic failures.Recovery objectives
Define clear recovery targets based on business requirements:- RTO (Recovery Time Objective): Maximum acceptable downtime (typically 1-4 hours for production systems)
- RPO (Recovery Point Objective): Maximum acceptable data loss (typically 1-24 hours depending on backup frequency)
Disaster recovery procedures
Full cluster restoration from backups:- Provision new infrastructure matching production specifications
-
Restore data stores in dependency order:
- TiDB/TiKV (primary data store) - Restore TiDB first as it is the authoritative source of user, conversation, and message metadata
- MongoDB (metadata and configuration)
- Kafka (if message history is critical)
- Redis (optional, can rebuild from primary data)
- Restore application services and verify connectivity
- Validate data integrity through application smoke tests
- Update DNS to point to new cluster
- Monitor closely for 24-48 hours post-recovery
Geographic redundancy
Backup storage locations:- Maintain a minimum of three geographically isolated backup copies
- Primary backup: Same region as production (fast recovery)
- Secondary backup: Different region (regional disaster protection)
- Tertiary backup: Different cloud provider or on-premiseises (provider-level disaster protection)
- Use cloud storage replication (S3 cross-region replication, GCS multi-region)
- Implement backup verification at each location
- Test restore from each backup location quarterly
Disaster recovery testing
Quarterly DR simulations: Run staged disaster recovery exercises to validate procedures and train teams:- Warm-standby restoration: Restore to standby environment, validate without cutting over
- Full cluster rehydration: Complete restore from backups in isolated environment
- Failover testing: Practice DNS cutover and traffic migration procedures
- Rollback testing: Validate ability to roll back to previous state if needed
- Backup files are accessible from all locations
- Restore procedures are documented and up-to-date
- Team members know their roles and responsibilities
- Communication channels are established
- Restore time meets RTO requirements
- Data integrity is validated post-restore
- Application functionality is verified
- Lessons learned are documented and procedures updated
Backup security
Encryption:- Encrypt backups at rest using AES-256 or equivalent
- Encrypt backups in transit using TLS
- Store encryption keys separately from backup data (use key management service)
- Restrict backup access to authorized personnel only
- Use separate credentials for backup operations
- Audit all backup access and modifications
- Implement multi-factor authentication for backup system access