Posted by Ian Holsman
Tue, 24 Jun 2008 19:30:00 GMT
So I have recently been paying a lot of attention to systems with huge amounts of data in them.
be it Relegence that deals with lots of incoming news stories and figuring out what they are about in real time or the data layer that is dealing with click streams and recommendation engines.
One of the interesting questions is how we make this data available to the publishing systems, as the data sizes mean we can’t find the entire table onto a single machine.
So in my research I have seen 4-5 ways to do horizontal partitioning (or is it vertical.. i always get confused with the names).
-
consistent hashing. easy to understand. easy to implement, until you need to add a group of machines.
-
A central database holding the location of the records. An oldie but a goodie. you can easily add machines, and reconfigure the distribution to ‘move’ records across machines to compensate. But It has a central directory which means you need to worry about scalaing the directory (a smaller problem to be sure)
-
A Distributed Hash Table approach like the one discussed at onscale
- Crush a pseudo-random distribution thing similar to consistent hashing, that can handle adding new machines and load issues.
- Amazon’s Dynamo which has a “a gossip based distributed failure detection and membership
protocol.” looking deeper it some mixture of consistent hashing and a dht/chord.
- Hypertable/HBase. which leave the partitioning (and replication) to the distributed file store they sit on (hadoop). which isn’t a bad idea as alot of work has gone into that. but I still have my doubts on how it will handle a OLTP load (and I should just bit the bullet and run a performance test to remove my doubts)
you also need to deal with replication/load balancing issues. Specifically you need to handle the case of failure (and failure of racks / data centers).
from what I can see you have 2 different choices to make.
- Full Consistency vs Eventual Consistency
- What level do to replicate at
most of the people (including engineers to be honest) can’t seem to get their head around this eventual consistency thing. They are used to a update changing the value there and then.
replication on the other hand is usually handled by letting mysql do it. personally i prefer a finer grain approach where you can have some records with more replicas than others to cater for some things being more equal than others. (brittney spears gets more hits than johnny cash for example), and dynamic loads where the system automagically adds more replicas based on activity
whatever replication we choose it has to be battled hardened.. you don’t want to call up the CEO and explain why 10% of his data has just disappeared into the ether, or why you will be down for 6 hours recovering from a tape.
the other choice is what the data store should be.
- a relational DB like mysql.
- a Key/Value pair, with limited semantics on how to retrieve/store the information
personally I like the key/value pair as it makes life simpler.. but I know a lot of people who like mysql
so.. this is main thing on my mind at the moment.
your thoughts are welcome naturally
Posted in Development | Tags distribution, hashing, ohmy, partioning, replication | 4 comments
Posted by Ian Holsman
Fri, 11 Apr 2008 05:57:00 GMT
locallucene, a Geographical searching plugin to lucene and solr is now powering our yellowpages site.
all the props should go to patrick, locallucene is his brainchild. .. if you use locallucene you can always send him a pizza.. as thanks
Posted in Development | Tags aol, locallucene, solr, yellowpages | 2 comments
Posted by Ian Holsman
Fri, 11 Apr 2008 03:57:00 GMT
so i had some more time to thing about appengine, and the biggest problem I can see is the lock-in. all the other things are minor
Krow weighs in about people’s complaining about lock-in. Initially I thought so too, as there is no equivalent to GQL. but then I remember about hbase and hypertable. Once some open source guy writes a GQL clone the platform is open and I see multiple hosting providers offering it as an alternative. personally I think the lack of joins a bonus. it prevents web developers from writing slow apps ;-)
the lack of language support is temporary.. I mean how hard would it be to make java not be able to access the local file system or jni? just replace/overwrite some jar files (unless you have legal issues that preclude someone doing that).
but it is still a 3rd platform, and definatly a boon to python guys. now.. what to call the generic version? Python Hadoop, And Gql (PHAG?)
Posted in Development | Tags appengine, google, python | 2 comments
Posted by Ian Holsman
Mon, 09 Jul 2007 19:43:00 GMT
One of the things I’m responsible for at AOL is their use of Solr in their upcoming web developments.
a task that we keep on finding ourselves doing is taking a input feed (be it CSV, XML, or DB table) and transforming that into a Solr Index. (we call it injestion), it’s a boring and thankless task, but it is critical to get it done correctly. Especially when you need to deal with real time and batch updates.
This led me to have a reason to try out Kettle, which is a open source ETL engine to do these kind of things. But out of the box it had no support for Solr :-(
So I created this proof of concept plugin to show how easy it could be to just shove a data stream into solr, and am trying to get a demo going showing how easy it is to take some input data and make it into a Solr search engine (as well as other things at the same time).
It works well enough for me to do a proof of concept with a couple of different feeds and show the channel development teams how easy life could be.
disclaimer: before you go and start using it in production, please be aware that it needs alot more work when it comes to setting options and stability.
So if your interested in this type of thing.. feel free to ping me and I’ll add you to the project. (with the aim that either Solr or Kettle take this and make it part of their standard packages)
Posted in Development | Tags kettle, solr | 1 comment
Posted by Ian Holsman
Tue, 03 Jul 2007 18:37:00 GMT
I’ve just created a new module for the apache webserver 2.2 which implements one of Brad’s features of perlbal.
The ability to concatenate CSS or javascript files into a single HTTP request.
the request will look like:
http://hostname/cdn/??music2.js,mp.js,dir1/dalai_llama.js,ratings_widget.js,widget_config.js,common.js
my initial testing shows a performance gain of about 1 second when I request a file from the other side of the pacific ocean.
you can try it out yourself: with concat / without concat
The multiple host names reflect the original page going to multiple hosts to retrieve the files (which are on multiple CDNs)
firebug shows the following:
from

to

code is hosted on: googlecode
Posted in Development | Tags apache2 | 2 comments
Posted by Ian Holsman
Wed, 13 Jun 2007 17:52:00 GMT
It’s been a long time since I bought a book about mysql, so I thought I would ask what were some of the recent good books around about mysql.
The only good book I know is ‘high performance mysql’, but it is from 2004. so I’m concerned it’s a bit dated.
So.. what’s on your bookshelves?
Posted in Development | Tags mysql | 4 comments
Posted by Ian Holsman
Wed, 23 May 2007 20:02:00 GMT
I’m looking for a technical person with a background in search technologies to head up a small team of developers in Dulles, VA (no prizes for guessing who).
you will be working with products like:
- solr
- nutch / hadoop
- lucene
- apache httpd
- mysql
Ideally you will be a committer/member of the ASF, and have had experience in large web applications in publishing/search.
Your responsibilities will be to help make the products you support meet our application teams needs, and to coordinate your team
with the OSS groups involved and a similar internal team in Bangalore.
Naturally these will be contributed back to the OSS world if it makes sense.
like to know more?
mail me @ jobs@holsman.net
Posted in Development | Tags job | 2 comments
Posted by Ian Holsman
Sat, 28 Apr 2007 06:32:00 GMT
I got a new dev machine (a quad-core 4G box, with 500G disk running centos5) during the week, and took the opportunity to play with xen.
Once you get the hang of it, creating new linux instances is quite easy. I’ve actually got my main host running in a little instance where i just port forwarded port 80 through to it. (as I don’t have any spare IP’s)
I just wish ‘virtmanager’ had a native mac build, I found using x over VNC a PITA and it just didn’t work sometimes.
To make matters worse, the new longhorn beta DVD uses UDF instead of iso9660, and I can’t figure out how to get xen to recognize it either.
the main aim of the dev box was to get 10-15 machines up and running so I can play with things like hadoop, and replication things, and test out monitoring stuff.
Posted in Development | Tags xen | no comments
Posted by Ian Holsman
Sun, 15 Apr 2007 01:20:00 GMT
Hi.
I need a physical box in a colo somewhere in the US. something which will give me root access, and leave me alone to do my work. (RHEL4/5 as a base OS would be great).
anyone have any recommendations?
—Ian
Posted in Development | Tags colo | 7 comments
Posted by Ian Holsman
Tue, 21 Nov 2006 10:51:00 GMT
I was look at six apart’s newly launched vox blog server, and noticed their javascript/css server.
aka-static.vox.com
what is neat about it is that it does concatenates several files into one when it serves the request (as well as compressing javascript on the fly)
for example.
instead of having 5-10 seperate script references, they combine it into one call.
http://aka-static.vox.com/.shared:v17.1:vox:en_us/js/core.js,dom.js
concatenates the core.js and dom.js files together before it serves it.
if you append a ‘c’ it will serve the compressed version of it.
http://aka-static.vox.com/.shared:v17.1:vox:en_us/js/core.jsc,dom.jsc
from a performance viewpoint, this should speed up page delivery for high-latency clients, and makes it easy to upgrade your javascript versions (and locale) as you only need to do it in one place.
Posted in Development | Tags coolidea, javascript, vox | 2 comments | no trackbacks