How to measure the availability of cloud foundation services?

The core of measuring the availability of cloud foundation services is to quantify its ability to continuously and stably provide services during a specific period of time, usually expressed as the "percentage of normal operating time". The most common way is to judge it through the "several 9s" promised in the service level agreement (SLA), for example, 99.99% availability means no more than 52 minutes of unavailability per year.
Key indicators and methods for measuring the availability of cloud foundation services:
Uptime Percentage
This is the most direct way of measurement, and the calculation formula is:
usability
=
Normal operating time
Total time
Availability=
Total time
Normal operating time
?

Cloud service providers usually promise different levels of availability, with specific corresponding relationships as follows:
table
Availability level allows for annual downtime in typical application scenarios
99% (two 9s) 3.65-day non critical testing environment
99.9% (three 9s) 8.76 hour basic production system
99.99% (four 9s) 52.6 minutes core business platform
99.999% (five 9s) 5.26 minutes extremely high availability requirement system
Service Level Agreement (SLA)
SLA is a formal commitment made by cloud service providers to customers, which clearly sets out availability targets and compensation mechanisms for non-compliance. It is the authoritative basis for measuring usability. For example, mainstream cloud platforms generally promise to achieve "four nines" or above for critical services.
Mean Time Between Failures (MTBF) and Mean Time to Repair (MTTR)
MTBF (Mean Time Between Failures): Reflects system stability, with longer values indicating greater reliability.
MTTR (Mean Time to Repair): measures the speed of fault recovery, which directly affects the actual availability experience.
The combination of the two can provide a more comprehensive evaluation of actual operational performance.
Request success rate and failure rate
Measure availability from a terminal perspective by monitoring the success rate of user requests (such as the percentage of HTTP 2xx responses) and failure rate (such as 5xx errors). This method is closer to user experience.
Multi dimensional monitoring and active detection
Passive monitoring: Collecting real user access logs for analysis.
Proactive monitoring: Regularly sending simulated requests (such as Ping, HTTP probing) to detect potential interruption risks in advance.
The combination of the two can achieve dynamic and real-time evaluation of availability.
Availability report based on historical data
Cloud manufacturers will regularly release service health reports, showcasing their actual availability performance over a period of time, which is more valuable as a reference than SLA.

Blog Dir

Comments