Apache ZooKeeper 3.1.1 Released - High-Performance Coordination Service for Distributed Applications

By Angsuman Chakraborty, Gaea News Network
Saturday, March 28, 2009

zookeeper - high performance distributed coordination serviceZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface so you don’t have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.

The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The performance aspects of ZooKeeper means it can be used in large, distributed systems. The reliability aspects keep it from being a single point of failure. The strict ordering means that sophisticated synchronization primitives can be implemented at the client.

ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.

Zookeeper, unlike many Open Source Software, is an excellent product and enhancement for Hadoop.

The key changes in this new release are:

Allow specialization of quorum config parsing (e.g. variable expansion in zoo.cfg)
We would like to define certain parts of ZooKeeper’s configuration using variables that get substituted. For example, we want the ZooKeeper quorum to be able to use a dataDir configured per user.
autoreconf fails for /zookeeper-3.0.1/src/c/
core dump using zoo_get_acl()
The zookeeper_process() function used to incorrectly call the c.acl_result member of the completion_list_t structure when handling the completion from a synchronous zoo_get_acl() request. The c.acl_result member is set to SYNCHRONOUS_MARKER, which is a null pointer.
Problem with successive leader failures when no client is connected
This was a very interesting bug, found by Sunanda Bera. Here is how you can reproduce it in older versions:
Create a 3 node cluster . Run some transactions and then stop all clients. Make sure no other clients connect for the duration of the test.Let L1 be the current leader. Bring down L1. Let L2 be the leader chosen. Let the third node be N3. Note that this will increase the txn id for N3’s snapshot without any transaction being logged. Now bring up L1 – same will happen for L1. Now bring down L2.

Both N3 and L1 now have snapshots with a transaction id greater than the last logged transaction. Whoever is elected leader will try to restore its state from the filesystem and fail.

add locking around auth info in zhandle_t
Looking over the zookeeper.c code it appears that the zoo_add_auth() function may be called at any time by the user in their “main” thread. This function alters the elements of the auth_info structure in the zhandle_t structure.Meanwhile, the IO thread may read those elements at any time in such functions as send_auth_info() and auth_completion_func(). It seems important, then, to add a lock which prevents data being read by the IO thread while only partially changed by the user’s thread.

call auth completion in free_completions()
If a client calls zoo_add_auth() with an invalid scheme (e.g., “foo”) the ZooKeeper server will mark their session expired and close the connection. However, the C client has returned immediately after queuing the new auth data to be sent with a ZOK return code.If the client then waits for their auth completion function to be called, they can wait forever, as no session event is ever delivered to that completion function. All other completion functions are notified of session events by free_completions(), which is called by cleanup_bufs() in handle_error() in handle_socket_error_msg().

In actual fact, what can happen (about 50% of the time, for me) is that the next call by the IO thread to flush_send_queue() calls send() from within send_buffer(), and receives a SIGPIPE signal during this send() call. Because the ZooKeeper C API is a library, it properly does not catch that signal. If the user’s code is not catching that signal either, they experience an abort caused by an untrapped signal. If they are ignoring the signal – which is common in context I’m working in, the Apache httpd server – then flush_send_queue()’s error return code is EPIPE, which is logged by handle_socket_error_msg(), and all non-auth completion functions are notified of a session event. However, if the caller is waiting for their auth completion function, they wait forever while the IO thread tries repeatedly to reconnect and is rejected by the server as having an expired session.

standalone server ignores tickTime configuration
When using the ZooKeeper server in standalone mode, it ignores the tickTime setting in the configuration file and uses the DEFAULT_TICK_TIME of 3000 coded into ZooKeeperServer.java.
zookeeper standalone server does not startup with just a port and datadir.
This was a regression defect.
c client issues (memory leaks) reported by valgrind
helgrind thread issues identified in mt c client code
Several race conditions.
regression in QuorumPeerMain, tickTime from config is lost, cannot start quorum
ZOOKEEPER 330/336 caused a regression in QuorumPeerMain – cannot reliably start a cluster due to missing tickTime.
will not be displayed