Micro Guide To Distributed Batch Processing With PBS (Portable Batch System)
By Angsuman Chakraborty, Gaea News NetworkSunday, February 25, 2007
PBS is an old name in distributed batch processing circles. We had used it earlier in DoubleTwist for genomic annotations. Portable Batch System (or simply PBS) is a software job scheduler that allocates networked resources to batch jobs. It can schedule jobs to execute on networked, multi-platform UNIX environments (and widnows if you are using PBSPro). It is a superb envrionment to run discrete jobs in wide variety of languages (language independent) like Java or Perl (we used both).
For a long time PBS was free and openly available. However recently it is hidden behind compulsory regsitration which has to be manually reviewed (with promise of taking several days to approve). In short lots of pain. An alternative is to download PBSPro at a premium price. However I have better alternatives for you.
You can easily download OpenPBS here. It contains the latest available version of OpenPBS. OpenPBS is not currently undergoing active development by the owner as its resource are focussed on PBSPro. You can download PBS patches from Argonne National Laboratory.
A better alternative is to use Torque Resource Manager. It is a community effort based on the PBS project. However it has been significantly improved over the years with more than 1,200 patches, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC , the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations.
TORQUE provides enhancements over standard OpenPBS in the following areas:
* Fault Tolerance
o Additional failure conditions checked/handled
o Node health check script support
* Scheduling Interface
o Extended query interface providing the scheduler with additional and more accurate information
o Extended control interface allowing the scheduler increased control over job behavior and attributes
o Allows the collection of statistics for completed jobs
* Scalability
o Significantly improved server to MOM communication model
o Ability to handle larger clusters (over 15 TF/2,500 processors)
o Ability to handle larger jobs (over 2000 processors)
o Ability to support larger server messages
* Usability
o Extensive logging additions
o More human readable logging (ie no more ‘error 15038 on command 42′)
My personal recommendation is that you should go ahead and adopt Torque if you foresee long involvement with distributed batch processing.
For short stints OpenPBS might suit you just fine. However for that you may also consider CHAOS as a simpler solution.
Note: CHAOS is a secure OpenMosix based Linux clustering solution which is fundamentally different from PBS and potentially faster.
Tags: Active, Distributed Batch Processing, Usc