> ## Documentation Index
> Fetch the complete documentation index at: https://www.cometchat.com/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Persistence & Backup

> Plan CometChat on-premise persistence, backups, restore testing, storage requirements, retention, and disaster recovery.

Defines how persistent data is stored, backed up, and restored in production environments. Proper backup and disaster recovery procedures are essential for business continuity and data protection.

**Key Objectives:**

* Protect against data loss from hardware failures, human errors, or disasters
* Enable point-in-time recovery for compliance and operational requirements
* Minimize Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
* Validate backup integrity through regular restore testing

## Volume layout

Docker volumes provide persistent storage for stateful services. Each service stores data at specific mount points within containers:

| Service | Default path          | Data stored                           |
| ------- | --------------------- | ------------------------------------- |
| TiKV    | `/data`               | Distributed key-value data, Raft logs |
| PD      | `/data`               | Cluster metadata, timestamp oracle    |
| Kafka   | `/var/lib/kafka/data` | Message logs, topic partitions        |
| Redis   | `/data`               | Cache data, session state             |
| MongoDB | `/data/db`            | Document collections, indexes         |

**Storage requirements:**

* All persistent volumes should be backed by SSD or NVMe storage for production deployments
* Provision adequate IOPS for database workloads (minimum 3000 IOPS for TiKV)
* Monitor disk space usage and set alerts at 75% capacity
* Plan for 30-50% growth buffer beyond current usage

## Backup strategy

Implement automated, regular backups for all stateful services with appropriate retention policies:

### TiDB backups

**Frequency**: Daily full backups, hourly incremental backups (for critical deployments)

**Method**: Use TiDB BR (Backup & Restore) tool for consistent cluster snapshots

```bash theme={null}
# Full backup
tiup br backup full \
  --pd <pd-address>:2379 \
  --storage "local:///backup/$(date +%Y%m%d)" \
  --ratelimit 120 \
  --log-file backup.log

# Incremental backup
tiup br backup incremental \
  --pd <pd-address>:2379 \
  --storage "local:///backup/incremental/$(date +%Y%m%d_%H%M)" \
  --lastbackupts <last-backup-timestamp>
```

**Storage**: Secure, off-cluster storage (S3, NFS, or dedicated backup server)

**Retention**: 30 days for daily backups, 90 days for monthly backups (adjust based on compliance requirements)

### Kafka backups

**Frequency**: Weekly segment-level backups

**Method**: Copy Kafka data directories or use MirrorMaker for replication to backup cluster

```bash theme={null}
# Stop Kafka broker (if taking offline backup)
docker service scale kafka=0

# Backup Kafka data directory
tar -czf kafka-backup-$(date +%Y%m%d).tar.gz /var/lib/kafka/data

# Restart Kafka broker
docker service scale kafka=1
```

**⚠️ Warning**: Stopping Kafka brokers will interrupt message delivery. Perform offline backups only during maintenance windows.

**Retention**: 4 weeks (Kafka data is typically transient with configurable retention)

### MongoDB backups

**Frequency**: Daily backups

**Method**: Use mongodump for logical backups or filesystem snapshots for physical backups

```bash theme={null}
# Logical backup with mongodump
# Ensure mongodump uses a read-only backup user with minimal privileges
docker exec mongodb mongodump \
  --out /backup/mongodb-$(date +%Y%m%d) \
  --gzip

# Copy backup to secure storage
docker cp mongodb:/backup/mongodb-$(date +%Y%m%d) /secure/backup/location/
```

**Retention**: 30 days for daily backups, 1 year for monthly backups

### Redis backups

**Frequency**: RDB snapshots every 6 hours

**Method**: Redis automatically creates RDB snapshots based on configuration

```bash theme={null}
# Trigger manual snapshot
docker exec redis redis-cli BGSAVE

# Copy RDB file to backup location
docker cp redis:/data/dump.rdb /backup/redis-$(date +%Y%m%d_%H%M).rdb
```

**Note**: Redis data is non-authoritative and can be safely rebuilt from TiDB and Kafka in most scenarios. Backups are primarily for faster recovery rather than data preservation.

**Retention**: 7 days (cache data has short-term value)

### Backup validation

**Monthly restore tests**: Perform full restore to staging environment to verify backup integrity

```bash theme={null}
# Example: Restore TiDB backup
tiup br restore full \
  --pd <staging-pd-address>:2379 \
  --storage "local:///backup/20240119" \
  --log-file restore.log

# Verify data integrity
# Run application smoke tests against restored data
```

**Validation checklist**:

* Backup files are complete and not corrupted
* Restore process completes without errors
* Data integrity checks pass (row counts, checksums)
* Application can connect and query restored data
* Restore time meets RTO requirements

## Disaster recovery

Establish comprehensive disaster recovery procedures to ensure business continuity in the event of catastrophic failures.

### Recovery objectives

Define clear recovery targets based on business requirements:

* **RTO (Recovery Time Objective)**: Maximum acceptable downtime (typically 1-4 hours for production systems)
* **RPO (Recovery Point Objective)**: Maximum acceptable data loss (typically 1-24 hours depending on backup frequency)

**Note**: Actual RTO/RPO depends on backup size, network bandwidth, and restore automation maturity. Test and validate your specific recovery times.

### Disaster recovery procedures

**Full cluster restoration from backups:**

1. **Provision new infrastructure** matching production specifications

2. **Restore data stores** in dependency order:
   * TiDB/TiKV (primary data store) - Restore TiDB first as it is the authoritative source of user, conversation, and message metadata
   * MongoDB (metadata and configuration)
   * Kafka (if message history is critical)
   * Redis (optional, can rebuild from primary data)

3. **Restore application services** and verify connectivity

4. **Validate data integrity** through application smoke tests

5. **Update DNS** to point to new cluster

6. **Monitor closely** for 24-48 hours post-recovery

**Example TiDB restore:**

```bash theme={null}
# Restore TiDB cluster from backup
tiup br restore full \
  --pd <new-pd-address>:2379 \
  --storage "s3://backup-bucket/20240119" \
  --log-file restore.log

# Verify cluster health
tiup cluster display <cluster-name>

# Run data integrity checks
# Check row counts, run application queries
```

### Geographic redundancy

**Backup storage locations:**

* Maintain a minimum of three geographically isolated backup copies
* Primary backup: Same region as production (fast recovery)
* Secondary backup: Different region (regional disaster protection)
* Tertiary backup: Different cloud provider or on-premiseises (provider-level disaster protection)

**Important**: Ensure backups are completed and verified before object storage lifecycle rules expire older snapshots.

**Replication strategies:**

* Use cloud storage replication (S3 cross-region replication, GCS multi-region)
* Implement backup verification at each location
* Test restore from each backup location quarterly

### Disaster recovery testing

**Quarterly DR simulations:**

Run staged disaster recovery exercises to validate procedures and train teams:

1. **Warm-standby restoration**: Restore to standby environment, validate without cutting over
2. **Full cluster rehydration**: Complete restore from backups in isolated environment
3. **Failover testing**: Practice DNS cutover and traffic migration procedures
4. **Rollback testing**: Validate ability to roll back to previous state if needed

**DR drill checklist:**

* [ ] Backup files are accessible from all locations
* [ ] Restore procedures are documented and up-to-date
* [ ] Team members know their roles and responsibilities
* [ ] Communication channels are established
* [ ] Restore time meets RTO requirements
* [ ] Data integrity is validated post-restore
* [ ] Application functionality is verified
* [ ] Lessons learned are documented and procedures updated

### Backup security

**Encryption:**

* Encrypt backups at rest using AES-256 or equivalent
* Encrypt backups in transit using TLS
* Store encryption keys separately from backup data (use key management service)

**Access control:**

* Restrict backup access to authorized personnel only
* Use separate credentials for backup operations
* Audit all backup access and modifications
* Implement multi-factor authentication for backup system access