Scalability and frameworks

Posted by Ian Holsman Sun, 11 Dec 2005 22:56:00 GMT

Following on from Jeremy’s Web 2.0 companies need to scale article, I’d like to explore another avenue of that question.

How do the modern frameworks help you, the developer scale your application seamlessly?

In my experience there are 2 major ways scalability is adressed. Caching, and distribution of the data. I’m not calling myself an expert in any of the frameworks above, but from what I can see all of them handle caching pretty damn well. so I’ll like to focus on database scalability a bit here.

First when I talk ‘scalability’, what I really mean by this is having the response return in a consisentent (low variablity) time for the given request. The aim of scalability is to have this variability not increase as the number of users increase.

I have seen the following approaches to handling large (and larger) amounts of data. They are:

  • buy a larger machine for the DB server, all too often the beefiest box is the DB server, and if it goes down so does everything ;( for the frameworks in question, handling this type of strategy is a no brainer.. you provide a pool of connection handles and add some tuneables to the number of connections on the pool. easy stuff for the framework and the developer, as there is nothing for them to do ;-) but it starts to fall down when you have large amounts of data and large amounts of connections, rendering the caching strategy of the DB inadequete (and response times becoming highly volatile)

  • federate another ‘do nothing’ for the framework and the developer. simply have one logical database which is served by multiple machines. I’m guessing that this would be the easiest ‘step’ for people to move to from the ‘large’ machine approach. I haven’t looked too much into the internals of the MySQL implementation, but hopefully it does some of the query caching on the remote machine and it will make for more consistent results. This is a good solution if your application has lots of loosely related data (lots of little applications) and the query joins can be done remotely.. and can tide you over for a period longer as you can use 3-4 machines instead of a single one.

  • Replication this is a godsend for many write few read many situations. basically you write to a ‘master’ and then you connect to one of the slaves for selects.. If response times are getting sluggish/too variable you can stick another slave into the mix. The frameworks I’ve seen don’t seem to handle this at all. Most of the time you can only specify a single connection. so either you need to split your application into 2, with the ‘write’ parts in one, and the ‘read’ in another or handle this stuff yourself.. YUK. what a large waste of effort and source of errors.

  • clustering by key-value this can be seen in sequoia/C-JDBC this technology emulates a ‘large’ machine but allows the admin to split tables onto multiple machines, as well as providing federation (and cross DB type) functionality. while this seems to be a ‘java only’ thing, they provide a ‘C’ library, so it could be used in other languages if someone had the energy to port it.

  • using a search engine as a DB source nutch’s NDFS and Map Reduce functionality. and lucene provide excellent alternatives to RDBMS, and could be used as a fast replacement in some cases.. but too often they are overlooked.. and again frameworks ignore them ;(

To me this is one of the limiting factors of the frameworks out there, and while some of this stuff can be done on top of the DB api they provide, it really should be integrated into it, allowing people to concentrate on the value-add instead of the infrastructure bits.

At the very least they should support replicated DB engines out of the box. They would need to have some kind of config where you can specify a write connection (DB/machine/port/user/password) and a read pool (db/list-of-machines/user/password) for which a connection pool could be created. and have the framework decide which one to use based on the operation.

Posted in  | Tags , , , ,  | 2 comments | no trackbacks

Comments

  1. Avatar Radek said 1 day later:

    Hi Ian,

    I don’t think this falls into the scope of framework such as Django.

    I believe you should state only configuration params for the connection to the “Data store” and the data management system should handle the load, replication, etc. with its own configuration.

    I mean you should act as a sys admin, or configuration manager to handle these tasks. Not a developer.

    I hope I am making myself clear.

  2. Avatar Ian said 1 day later:

    Hi Radek, but at the moment there is no way that you can specify it in django (or in any of the frameworks).

    for example.. if I wanted to use a master/slave configuration, i would need to have a seperate ‘read’ connection and a seperate ‘write’ connection.. all i’m asking for is somewhere to configure this .. which there isn’t ;(

Trackbacks

Use the following link to trackback from your own site:
http://feh.holsman.net/trackbacks?article_id=scalability-and-frameworks&day=11&month=12&year=2005

Comments are disabled