Posted by Ian Holsman
Tue, 01 Jul 2008 17:59:00 GMT
so on the RRD mailing list there is a discussion on how to write a RRD server/accelerator to help speed up RRD. which is a great tool, but when you abuse it and try to capture hundreds of thousands of metrics it kinda uses a bit too much disk I/O. (read swamps the system)
So imagine my surprise when I noticed that orbitz has recently open sourced their monitoring framework
- ERMA: the monitoring API
- Graphite: a graphing component on top of it
- Whisper: a fixed size db that stores the info
and imagine my surprise when I found out it was written in Django, my favorite framework.
and now I find out Theo Schlossnagle has just released reconnoiter (reconnoiter project home)
now.. to find a couple of hours in the day to actually get into them.
Tags django, metric, monitoring, reconnoiter, rrd, whisper | 1 comment
Posted by Ian Holsman
Sun, 09 Jul 2006 15:57:45 GMT
Perfmon is a tool to help you diagnose your performance and QA issues within your Django application.
I’ve decided to charge $20 for it.
People using it for debugging open source projects can get it for free.
Is this a sell out? I hope people don’t see it this way.
Posted in Development | Tags django, monitoring, performance | 4 comments | no trackbacks
Posted by Ian Holsman
Thu, 08 Dec 2005 13:17:00 GMT
in Don’t scale: 99.999% uptime is for Wal-Mart he mentioned that 37signals is quite happy with 98% uptime, and the cost of increasing uptime isn’t worth it.
Here is a brief summary of what a extra ‘9’ will give you as far as uptime. (as a rule of thumb, each extra nine you add a extra zero at the end of the price it will cost you to get there).
| Uptime | Time lost in a year |
| 98% | 7.3 days |
| 99.0% | 3.7 days |
| 99.9% | 8 hours |
| 99.99% | 1 hour |
| 99.999% | 5 minutes |
Personally I think uptime is more a measure of reliability and redundancy than scalabilty, and would be sceptical when people talk about uptime.
why? well.. what is uptime? in most cases it means that a service is up and handling requests.
what it doesn’t measure (and hence not tell you)
- How responsive that service is. people will stop using your service if it is too slow. uptime does not measure this.
- *when* it was down. having something go down at 3AM is not the same as it being down at 3PM. while the world is global, most people only care about the USA. uptime doesn’t not know when your core business hours are.
- when something is partially down. Do you define yourselves as being ‘up’ when only half your site is functioning?
I think companies should define a metric more along the lines of:
the time taken to complete XXXX operation, between the hours 9AM and 9PM.
and then combine these timings into a weighted average. The weights being how important that operation is to your core business.
measure & monitor that. not uptime.
Have a look at Grab perf for an example of this. Stephen measures the response time as well as availability.
Posted in Business Related | Tags monitoring, performance, startups | 2 comments | 1 trackback
Posted by Ian Holsman
Wed, 21 Sep 2005 18:01:00 GMT
the first thing I did was reduce the number of active threads on the webserver, and that reduced the process creation rate by half. which was a good start.
Now this where statistics lie, or to put it more precisely don’t tell you what you think they are telling you.
I first looked at 5 minute averages and assumed that it was constant throughout the 5 minutes..wrong.. as this is a monitoring machine, it has lots of agents pushing data to it at regular intervals.
the main culprit had this embedded into the code.
# Wait for the next INTERVAL
sleep ($INTERVAL-time() % $INTERVAL);
this had the effect of turning 815 machines into a flash crowd every 5 minutes (not even being delayed by the time it took to complete the previous post, which would have had the effect of dispersing the flash crowd over time). every 5 minutes the poor webserver would receive 815 posts… go into swap hell, core dump a bit, and recover in time for the next lot.
the solution to this one was to wait a random interval before actually sending the data, and then hit the above syncing sleep so we still get stats from the same time period, but they are just sent a during the period instead of when we get them.
If you plan on doing this remember to record the stats to be effective as of before the random sleep interval, not afterwards as your counters would get all messed up.
you want
#hits = (#hits-end - #hits-start )/ ($timeend - $timestart)
not
#hits = (#hits-end - #hits-start)/ ($timeend+randombit - $timestart)
especially when the randombit gets large
oh.. in other news.. I have switched my RSS feed to use feedburner, current subscribtions are unaffected.. only new ones.
Tags monitoring, performance | no comments | no trackbacks
Posted by Ian Holsman
Tue, 20 Sep 2005 19:45:00 GMT
So.. people at work have been complaining about one my monitoring servers continually freaking out, and being slow..
so I thought.. why not open up a 2nd port with just mod-perl replacing some dodgy CGIs, freeing up some connections on the original port making it snappier at the same time.. win-win.
This is what I wake up to ;(.

The machine has gone beserk. for some reason a ‘stable’ perl program running via peristent perl had none of these issues, change it to mod-perl .. and wow less than 24 hours before I need to bounce.
now.. If I didn’t have
procallator running on the box I would still be puling my hair out trying to figure what was going on.. at least now I have a clue on what the problem looks like before the machine hangs.
I’m still not 100% confident that it is directly related to my change.. now that the ‘display’ server is faster, it got hammered harder by other things ;( I’m sure this will provide a weeks of hair pulling.
oh.. all the perl code does is write a file to disk, and it’s been running for years..so it is an interaction somewhere.
Posted in monitoring | Tags monitoring, performance | no comments | no trackbacks
Posted by Ian Holsman
Tue, 02 Aug 2005 18:49:00 GMT
I'm trying to find research done about community health, and how to monitor/measure it.
anyone got any pointers?
stuff like netscan.
Ideally a open source implementation of something like this (to save me doing it) would be ideal.
Tags community, monitoring | no trackbacks
Posted by Ian Holsman
Tue, 20 Jul 2004 15:49:00 GMT
$50/month for a 15Mb/s link to your home. Thats ~10 times faster than a standard DSL link... at the moment it's just a town in texas somewhere, but in ~1-2 years it will be available for most people (read Silicon Valley)
What *IS* important is latency.. how do we get the page out FASTER with minimal delays...
when modems were the majority, it didn't really matter much how fast your server was, the major bottleneck was the size of the HTML (and images) being pushed down the users end. (be it a T1 or a 56k modem).. basically 10 seconds for 50k. now.. the pressure has moved back to the server, and speeding up your applications, as that is where the large percentage of the wait will be.
source: http://news.com.com/Verizon's+fiber+race+is+on/2100-1034_3-5275171.html
Posted in monitoring | Tags monitoring, stats | no trackbacks
Posted by Ian Holsman
Thu, 08 Apr 2004 15:16:00 GMT
Jeremy Zawodny compared some timings of bzip2, gzip, and rzip to compress his mail folder (which I'm assuming is 90% text)
I thought I would do it on one of our largish indexes (which we have to copy around to different places several times a day) and see if it would help.
it's original size is 1.1G and is mainly binary data.
Machine stats:
OS: RH ES3.0
2x 2.4 P-IV xeons with hyperthreading
Mem: 6G
| command | compress (r/u/s) | decompress (r/u/s) | new size | ratio |
| gzip |
2m3 | 1m58 | 0m5 |
0m33 | 0m18 | 0m7 |
566m |
47% |
| gzip -9 |
3m45 | 3m39 | 0m5 |
0m35 | 0m17 | 0m8 |
563m |
47% |
| bzip2 -9 |
11m16 | 11m9 | 0m5 |
4m38 | 4m24 | 0m13 |
538m |
45% |
| rzip |
9m6 | 8m47 | 0m19 |
5m3 | 3m27 | 1m35 |
474m |
39% |
basic summary:
when you have a 1G card in machines don't even bother compressing it, unless you have to distribute it to >20 machines.. and if so.. any of these will do ;(
If you have to use a slower link, then I still think any of these will do.
on the plus side.. I watched The Adventures of Seinfeld & Superman while waiting for the tests to run.. very funny.. the macromedia streaming via flash is a VERY cool idea and seem to work quite well.
Posted in monitoring | Tags monitoring, stats | no trackbacks
Posted by Ian Holsman
Mon, 29 Mar 2004 15:04:00 GMT
After suffering through 400ms latency times to work. (I live in Melbourne Australia, and work in San Francisco) I installed
Smokeping Something I can look at while waiting for my terminal to respond.
It isn't all that bad, as I can do most of my development type work locally and just push it over.. thank god for portability between my Mac & the Linux boxes.
Posted in monitoring | Tags monitoring, stats | no trackbacks