A Simple Overview of Distributed Computing
By Angsuman Chakraborty, Gaea News NetworkSaturday, November 8, 2008
Any reasonable (web) application which deals with some subset of web data (rss feeds, web pages, product pricing data etc.), has to use distributed data processing to go anywhere. Unless you have very deep pockets and / or have strong VC funding (which is rarer than a bottle of Mouton Rothschild Pauillac Premier Cru First Growth these days) you will have to opt for consumer grade hardware (instead of big iron which more companies opted during the dotcom era ) and use them in a distributed computing framework like hadoop or GridGain or older (more mature) high throughput computing systems like OpenPBS (ignore the marketing talk to download the free version) or Condor.
Perhaps Google can be credited to ignite renewed public interest in distributed computing with their seminal papers on map-reduce architecture, a programming model for data processing on large clusters of several thousands of consumer grade computers (where hardware failure is the norm rather than exception), and distributed database.
Yahoo adopted an open source implementaion of map-reduce framework (hadoop) as well as hbase (open source implementation of BigTable). They run hadoop on more than 100,000 CPUs in ~20,000 computers. Their biggest cluster has 2000 nodes (2*4cpu boxes w 4TB disk each). Yahoo primarily uses hadoop to support research for Ad Systems and Web Search.
We used OpenPBS at DoubleTwist with 40 4-CPU Sun boxes and an E10000 server. Today, however, I opted for a Hadoop cluster running on 16 consumer grade machines for data mining. Why?
What makes today’s distributed computing frameworks different (better?) from the previous generation software tools?
Designed for processing / data node failure
Any distributed computing software has to be designed for failure of nodes without it bringing down the whole system. Ideally there will be a graceful degradation of service as more and more node goes offline. A robust distributed computing system will recover from failed hardware and continue gracefully with the remaining hardware at its disposal.
Today we are dealing with several terabytes of web data, our computing requirements have increased exponentially but the reliability of hardware hasn’t. Consumer grade hardware are more likely to fail than high quality server grade hardware which were more commonly used before. Today hardware failure is the norm. A good distributed system today will allow you to hot-swap new hardware which will then be added dynamically to the list when it becomes available and ready for service.
Map-reduce
Map-reduce programming model, supported by today’s distributed frameworld like hadoop or GraidGain, helps distributing several kinds of tasks which were hard to distribute before. It helps Google analyze and index the massive amount of web data everyday and also support several of its services. Map-reduce is very well suited for processing kind of web data.
Amazon EC2 - Cloud Computing
Amazon’s EC2 initiative in cloud computing deserves a mention. Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers.
Amazon EC2’s web service interface allows you to easily obtain and configure capacity. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. You to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios.
I can see huge resistance from companies in relying on Amazon with their crucial business data. However in some way it is similar to dedicated web hosting.
My concerns are elsewhere. I am more comfortable with a constant cost factor than a dynamic cost component which can vary rapidly with usage.
Having said that consumer grade cloud computing really has to be made much simpler and Amazon’s EC2 is a step in the right direction.
Who needs distributed computing?
Everyone, including you. You can use it for simple but time consuming tasks like log file analysis to building complex pipeline processing web data or indexing it for specialized search engines. I would hazard my reputation to say that anyone who provides some kind of web based service relying on web data as source needs to heavily invest in distributed computing.
Tags: Condor, Distributed Processing, GridGain, hadoop, High Throughput Computing, HTC, Open Source, OpenPBS
November 13, 2008: 1:45 pm
[...] I hope we see more cloud computing adoption in the scientific community.Related articles by Zemanta:Yes, Distributed Computing is For YouThe Fall of Utility Computing and the Rise of The CloudAmazon’s cloud computing service fuels [...] |
Cloud Computing and Science - Mathematica Embraces Cloud Computing | CloudAve