Massively Scaled Java Technology Deployment
By Angsuman Chakraborty, Gaea News NetworkWednesday, September 7, 2005
At Doubletwist Inc. we worked with 40 4 CPU Sun Ultra Machines with 4 GB RAM each to carry out annotations of human genome. We were first, ahead of Celera and HGP.
At that time (2000-2001) it was possibly the largest massively scaled Java Technology Deployment. Human Genome Annotation run took about 1.5 months the first time. With several revisions it took about a month even with all that hardware and an additional Sun Ultra Sparc box.
Today I was reading about Become.com’s Web Crawler deployment. It maybe somewhat bigger in the data it handles and an interesting example of massive scaled deployment.
Become.com’s decision to deploy Java technology followed the experience of the company’s CTO, chairman, and cofounder, Yeogirl Yun, at Wisenut.com, where Wisenut spent a year creating a C++ web crawler that had significant memory and threading problems.
“We needed to do it faster this time,” observes Yun. “So we made the radical decision to implement a crawler using Java technology. No one believed it was possible, but we were able to build the prototype crawler in three months using two developers, which was a major achievement. The built-in network library, multithreading framework, and RMI [remote method invocation] saved a lot of development time.
Become.com’s crawlers build a web index, a searchable database, roughly every two weeks. It searches for shopping-related information only. The fetcher, which itself stores no information, classifies information by running several checks on every page it locates. It looks for page type and language and filters out duplicates or spam. It identifies links, buying guides, expert reviews, forums, articles, and other relevant materials. Then it sends information back to the crawl controller, which guides the crawl. Once the process is finished, it forms a database of all pages visited, in order by URL. Although searches are currently limited to English, the crawler is constructed so that it can scale easily to other languages.
The gathered information then goes to an “inverted” index, currently of 3.2 billion web pages, in order not by URLs but by keywords. Finally, the index is fine-tuned to both expert feedback from the Become.com research team and page-value connectivity analysis, which notes the frequency with which other pages on the same topic link to a page. The crawler takes about a week to complete its task. Finally, all of this information goes into the next crawl.
In developing Crawler B, Bart Niechwiej tried out the java.nio library (NIO) and got better performance than with a multithreaded version. Unfortunately, some classes — such as URL — did not support the NIO, so he implemented a URL connection.
He used Tomcat for his statistics server and required 20 GB of memory for fetchers, which ran on 10 separate 32-bit machines of 2 GB each.