the first thing I did was reduce the number of active threads on the webserver, and that reduced the process creation rate by half. which was a good start.
Now this where statistics lie, or to put it more precisely don't tell you what you think they are telling you.
I first looked at 5 minute averages and assumed that it was constant throughout the 5 minutes..wrong.. as this is a monitoring machine, it has lots of agents pushing data to it at regular intervals.
the main culprit had this embedded into the code.
# Wait for the next INTERVAL
sleep ($INTERVAL-time() % $INTERVAL);
this had the effect of turning 815 machines into a flash crowd every 5 minutes (not even being delayed by the time it took to complete the previous post, which would have had the effect of dispersing the flash crowd over time). every 5 minutes the poor webserver would receive 815 posts... go into swap hell, core dump a bit, and recover in time for the next lot.
the solution to this one was to wait a random interval before actually sending the data, and then hit the above syncing sleep so we still get stats from the same time period, but they are just sent a during the period instead of when we get them.
If you plan on doing this remember to record the stats to be effective as of before the random sleep interval, not afterwards as your counters would get all messed up.
you want
#hits = (#hits-end - #hits-start )/ ($timeend - $timestart)
not
#hits = (#hits-end - #hits-start)/ ($timeend+randombit - $timestart)
especially when the randombit gets large
oh.. in other news.. I have switched my RSS feed to use feedburner, current subscribtions are unaffected.. only new ones.