27 March 2011

More juice

Recently, I discovered a problem with wordsinmedia.com.
Week 11 was not fun.
There are three main parts to this system:
  • a database that stores stuff
  • a set of perl programs that acquire and process the news and store them in the database
  • and, a website that sits on top of the database whose backend executes within Jetty
A recent change I made that increased the number of news sources that were being polled and analyzed, caused a significant spike in resource utilization.

The additional processing on the perl side is CPU intensive, and with more news to process more cpu was being burned. With more data from the perl side, the MySQL instance had changed its growth rate: queries that dealt with hundreds of rows earlier were now dealing with tens of thousands, causing an increased load on the database. Collectively, everything had added up nicely to swamp the whole system leaving the any queries from the website to become dog slow- rendering the website in a very non-responsive state. And yes, it all lives together- this was nothing more than an experiment that grew incrementally, so...

All of my hardware is virtually provisioned, and lives within a cloud. I'm biased toward a specific one, but anyway...

As a first step, I figured I should isolate the various parts to see if that helps things along- there was just too much CPU being contended on to adequately isolate components to make a deterministic call on what was going on. I figured I'd separate the perl processing from the database/web server first. Fairly simple to do:
Need more power?

Provision a new node
Extraordinarily easy, and in many cases, free if you want a small amount of horsepower. Get an OS booted up on it and call it good.

Addressing
Since there's going to be node to node addressing for the perl programs to talk to the database node, you need a way to maintain address lookups. In my case, I rely on Elastic IPs which while public visible also provide internal IPs when used within a security group.

Fortunately, I only needed to make one change: point the perl programs to the elastic IP instead of pointing to localhost.

That's it. Asynchronous news acquisition and analysis is on one node, while the database and web server are elsewhere. As evident, separating those two would be trivial too- just get another node, place the war in a web server there, futz with addressing and call it good. If it doesn't work, scrap it- you lost nothing other than the time it took to run your experiment.

There's no rocket science in any of this. But its heart warming that in reality it really only takes a couple of hours (to the un-initiated like me) to get this done. Contrast that with trying to do this if you had to work with your own hardware- you'd either have to buy some, or hope there's some lying around, or make a case with your hardware team. Then you'd have to hope that this pans out well- since if it doesn't you just sank your investment in hardware.

This is, admittedly an almost contrived example of why on-demand virtual provisioning is awesome. But I think I got lucky in that my components were so inherently separable. My initial tendency might have been to do something horrible like have the news acquisition/processing live within the scope of the same war that powered the web-end. One deployment/logs/build to worry about right?

I've been part of many decisions where I suggested or was persuaded to accept that it was ok to stuff yet another component into an already large ball of yarn. Invariably, all of these would get knit together and thus become one inseparable bundle of pain.

With virtualization being so easy and cheap, I wonder how much easier it might be to consider spinning up fresh instances for every new component you consider? Granted- its a pendulum swing, and might not always be appropriate. But, if you used that premise as a baseline assumption- how would that change the end quality of what you build, how it can scale, and how easy it is to maintain?