Se rendre au contenu
Back to Tableau de bord et surveillance
Tableau de bord et surveillance Dashboard Surveillance

System Health Overview

Monitor your cluster's overall health status

Overview

The System Health page provides a comprehensive view of your Scalelite cluster's operational status. It monitors services, tracks errors, and calculates an overall health score to help you quickly assess system stability.

Health Score

The health score is a percentage (0-100%) that reflects your cluster's overall operational health.

90%
Health

Production Cluster

Last updated: 2 minutes ago

All Systems Operational
5/5
Servers Online
0
Errors (1h)
99.8%
Uptime (24h)
45ms
Avg Response

Health Score Calculation

The health score is calculated based on multiple factors:

  • Server Availability (40%): Percentage of servers online and responsive
  • Error Rate (25%): Number of errors in the last hour
  • Response Time (20%): Average API response latency
  • Resource Utilization (15%): CPU, memory, and disk usage levels

Health Score Thresholds

90-100%: Healthy - All systems operating normally
70-89%: Warning - Some issues detected, monitoring recommended
Below 70%: Critical - Immediate attention required

Service Status

Monitor individual services that make up your Scalelite infrastructure:

Scalelite API
Response: 32ms
Redis Cache
Memory: 45%
PostgreSQL
Connections: 12/100
Nginx Proxy
Requests: 1.2k/min
Recording Import
Queue: 5 pending
Poller Service
Last poll: 30s ago

Service Status Indicators

● Online Service is running normally
● Degraded Service running with reduced performance or queued work
● Offline Service is not responding or has failed

Error Tracking

The system health page also tracks recent errors:

  • Errors Last Hour: Count of errors in the past 60 minutes
  • Error Trend: Whether errors are increasing or decreasing
  • Most Common Error: The most frequently occurring error type

Best Practices

  • Check the System Health page daily or set up alerts for score drops
  • Investigate any service showing "Degraded" status
  • Monitor the error count trend - a sudden spike often indicates a problem
  • Use the 24-hour uptime metric to track historical reliability

Related: Setting Up Custom Alert Rules

Was this article helpful?

Vous avez encore besoin d'aide ? Contacter l'assistance

Rechercher parmi les réunions, les enregistrements et les participants
Appuyez sur ESC pour fermer