Monitoring stack
The following open-source tools form the monitoring and observability stack for CometChat on-premise deployments:- Prometheus: Collects and stores metrics from all services
- Grafana: Visualizes metrics with dashboards and alerts
- Loki: Stores and queries logs from all containers
- Promtail: Tails logs from Docker containers and pushes them to Loki
- Node Exporter: Collects host-level metrics (CPU, memory, disk, network)
- cAdvisor: Collects container-level resource usage metrics
Architecture
Key metrics to monitor
Infrastructure
- CPU usage per node
- Memory usage per node
- Disk space and I/O
- Network traffic
- Container resource usage
Application services
- WebSocket active connections
- Chat API request rate and latency
- API error rates (4xx, 5xx)
- Service uptime
Data stores
- Kafka: Consumer lag, message throughput
- Redis: Memory usage, cache hit ratio, connected clients
- MongoDB: Operation latency, connections, replication lag
- TiDB: Query duration, region health, storage capacity
Load balancer
- NGINX request rate
- Response status codes
- Active connections
Alerting
Alerts should focus on user impact, capacity risks, and data integrity rather than raw metric noise. Set up alerts for these critical conditions:- CPU usage > 80% for 5 minutes
- Memory usage > 85% for 5 minutes
- Disk space < 15%
- Service down for 2 minutes
- Database query latency > 100ms
- Kafka consumer lag > 10,000 messages
- Redis memory > 90%
- WebSocket connection errors > 10/second
- API error rate > 5%
- Container restarts
Grafana dashboards
Create dashboards to visualize:- Overview: System health, active users, request rates, error rates
- Infrastructure: CPU, memory, disk, network per node
- WebSocket: Active connections, message throughput, errors
- API: Request rate, latency, error rates by endpoint
- Databases: Query performance, connections, replication status
- Kafka: Consumer lag, throughput, partition health
- Logs & Error Analysis: Error aggregation, log volume, search, and correlation with metrics
Logs & Error Analysis Dashboard
This dashboard provides centralized visibility into application errors, log patterns, and system anomalies for rapid troubleshooting and incident investigation. Key Visualizations:- Error Volume by Service: Time-series graph showing error log count per service, helping identify which components are experiencing issues
- Top Error Messages: Table displaying the most frequent error messages with occurrence counts, enabling quick identification of recurring problems
- Log Volume Trends: Track total log volume over time to detect unusual spikes that may indicate issues or attacks
- Error Rate by Severity: Breakdown of errors by severity level (CRITICAL, ERROR, WARNING) for prioritization
- Service Health Correlation: Side-by-side view of error logs and service metrics (CPU, memory, latency) to correlate errors with resource constraints
- Search & Filter: Interactive LogQL query panel for ad-hoc log searches and pattern matching
- Recent Critical Errors: Live feed of the latest critical errors across all services for immediate awareness