What is 99% uptime anyway?
Posted by Ian Holsman
in Don’t scale: 99.999% uptime is for Wal-Mart he mentioned that 37signals is quite happy with 98% uptime, and the cost of increasing uptime isn’t worth it.
Here is a brief summary of what a extra ‘9’ will give you as far as uptime. (as a rule of thumb, each extra nine you add a extra zero at the end of the price it will cost you to get there).
| Uptime | Time lost in a year |
|---|---|
| 98% | 7.3 days |
| 99.0% | 3.7 days |
| 99.9% | 8 hours |
| 99.99% | 1 hour |
| 99.999% | 5 minutes |
Personally I think uptime is more a measure of reliability and redundancy than scalabilty, and would be sceptical when people talk about uptime.
why? well.. what is uptime? in most cases it means that a service is up and handling requests.
what it doesn’t measure (and hence not tell you)
- How responsive that service is. people will stop using your service if it is too slow. uptime does not measure this.
- *when* it was down. having something go down at 3AM is not the same as it being down at 3PM. while the world is global, most people only care about the USA. uptime doesn’t not know when your core business hours are.
- when something is partially down. Do you define yourselves as being ‘up’ when only half your site is functioning?
I think companies should define a metric more along the lines of: the time taken to complete XXXX operation, between the hours 9AM and 9PM. and then combine these timings into a weighted average. The weights being how important that operation is to your core business.
measure & monitor that. not uptime.
Have a look at Grab perf for an example of this. Stephen measures the response time as well as availability.
Comparing what uptime is, versus what you think a better definition might be, is an apples vs. oranges comparison at best. Companies that know what they’re doing have been measuring transaction throughput of key systems since the dawn of computing … this is nothing new.
Think of uptime as a foundation layer – without uptime, there is zero possibility of having timely and speedy transactions (whatever they may be), right? So – organizations (smart ones, anyway) work on uptime first, and then address performance second. There’s a boatload of tools available - from open-source to eterprise class - for measuring transaction performance metrics, and can even complete complex, multi-step transactions (think: end-to-end session-based shopping cart transactions on an e-commerce website) and measure the total process time. The tech is there, has been for years, but it requires looking for savvy providers who know how to provide that level of measurement. Having done it myself for years, I can tell you that it’s not cheap to do this level of metric.
Hence, the reason most providers you find simply talk about “uptime”. It’s not that there’s no other metric to work with – it’s that going above that level becomes a very costly and complex undertaking.
uptime is useless.
waste of time useless
seriously distracting useless
easily misunderstood and misinterpreted useless
non-tech people look at the number and use it as a judge of quality or availability, and that is the main problem.. if you give your business people a useless metric they will try to manage it. you as a tech-provider are giving them a false sense of security. you need to educate them better.
and I disagree giving them a response time is expensive.
add
time curl -s -o /dev/null http://holsman.net
to a cron job.
plot the results via RRDtool.
done.. you get a pretty picture in about 20 minutes.
and THAT will give you a much more realistic idea of your availability than a up/down metric.
uptime is only a realistic measure if you are a ISP and you are responsible for the network/router layer.