How does a lack of joins stop web developers from writing slow apps?

Posted by Ian Holsman Sat, 12 Apr 2008 16:15:00 GMT

on a previous post about appengine, noodl asked a good question

 

<b>How does a lack of joins stop web developers from writing slow apps?  Surely having to write more queries or denormalising a schema makes things slower?</b>

 

I was doing to write a long diatribe here (actually I did, and just deleted it).

by removing joins it forces the developer to work, and stop and think about what he is doing. writing that ‘cool’ 3-5-table join using all inner selects and left joins, while easy to type and runs great on a small dataset but usually sucks in performance. those queries can sometimes be rewritten into a set of SQL statements and a for-loop without much work by a developer. but usually these things are found about 1 week prior to launch (or 1 week after depending on if you ditched that performance QA phase when things were getting tight).

the usuall approach to these kind of things is even more ‘coolness’ you get to add a caching layer on top of your app. so instead of doing the 3-5 table join, you just get to do it for a single user every couple of hours when the cache expires, and add another layer of complexity for your operations team to maintain

 

The other approach I’ve seen used to great effect is pre-caching. in this example you can do it at the database layer. you take those 3-5 table join statements and just execute them in batch mode and store those results in a de-normalzed table. the web pages then reads that table instead.

 

so to answer your question directly. lack of joins stops web-developers from writing bad queries that the user have to wait on (and use memcache for) and they can either write the query using multiple selects (which can be faster in some cases, a wash in others, and slower in some as well) or query denormalized tables holding pre-computed results which is the fastest way.

updating the tables may be slower, but if you use async writing mechanisms the user wont be waiting for the write to occur. you may need to be a bit smart when a user has just updated something and your waiting for your writer to finish, but for most requests you can avoid that if you know the results will be there within a second/minute or so. or you can devise some sync/async writer where your update page updates some things so that the user knows it’s been updated, but his friends might see the update a bit later

Tags , , ,  | no comments

primary keys

Posted by Ian Holsman Sat, 12 Nov 2005 18:32:00 GMT

Brad Fitzpatrick has a nice discussion over what people should be using for their ‘id’ keys in a distributed environment. and puts a couple of points forward about why it isn’t such a good idea for his application/architecture. which has a ‘few’ central machines handing out ID’s.

The is a central problem with sequences that i’ve seen in most DB platforms. That and their 32-bit size. (I think mysql is 32-bit, i’m sure the geeks who read this will correct me)

The actual question has (If I am reading this correctly) is not about UUID’s vs local-sequences, but how do I move a ‘tree’ of records from one cluster to another, and not have a ID-clash. as if each cluster was independant, it wouldnt be a concern. you would have a central USER->cluster mapping algorithm/server which might be UUID or 64-bit based and then all future requests could be directed to the correct cluster, where locally-generated sequences would be fine.

Another approach would be to increase the size of the ID to 64-bits, with the top n-bits (say 8) designating the cluster which created the sequence. when a new cluster/allocater comes online it would register itself and it’s cluster-ID# would be incremented. then all sequences generated by this cluster allocater would just start at X. this would give you 56bits of uniqueness per cluster. but it would still require some kind of central co-ordination at cluster creation time.

This would increase the key size from 4 bytes to 8.

But personally, I would see how I can avoid this whole issue complelty, and try to make each cluster fully independant and NOT have to merge/move users between clusters… with the price of today’s hardware about $5k/box.. the issue of excess capacity and non utilization is less important I think.

Posted in  | Tags , ,  | no comments | no trackbacks