Blog entries published in 2013
Feeds: RSS | Atom

Fancontrol not working after resume on Ubuntu

Published: 2013-03-02 08:18 UTC. Tags: linux

Bought a new computer. Had some trouble with the fan controller built into the chassis, so got a couple of PWM fans instead since the motherboard can control 1 CPU and 3 chassis PWM fans.

The BIOS however was a bit limited when it came to how slow you could make the fan run. So turned to the fancontrol package in ubuntu, and after some fiddling it worked as intended, even turning off the case fan when the temperature was below the configured threshold.

However, after suspending then resuming, the fan would go at 100% again, and not spin down. There's a launchpad bug that tells me I'm not the only one with this problem.

Here's a workaround. Create /etc/pm/sleep.d/20_fancontrol with the following contents:


case "${1}" in
      /usr/sbin/service fancontrol restart

This will restart the fancontrol service after resume, which solves the problem. The fan will run at 100% for a little while at resume, since it takes a couple of seconds before this script is being run.


ZooKeeper failing to elect leader due to initLimit being too small

Published: 2013-02-09 19:00 UTC. Tags: zookeeper

At work, I use Apache ZooKeeper to coordinate a distributed service. I find ZooKeeper very easy to work with and program against, but as all software, it can be troublesome now and then.

I have a 3-node ZooKeeper cluster that was behaving very oddly the other day. It started with one of the nodes going down due to hardware trouble. This is supposed to be no problem since 2/3 nodes are still up and form a quorum. However, the whole service stopped serving clients.

At the time the node that went down crashed, it was the LEADING node of the cluster, with server id being 3. This meant another node needed to be elected as LEADER. The node with server id=2 was elected as leader, but failed to successfully establish leadership with a rather confusing error message in the log (/var/log/zookeeper/zookeeper.log):

2013-02-07 01:42:09,336 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@655] - LEADING
2013-02-07 01:42:09,336 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@154] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/zookeeper/version-2 snapdir /var/zookeeper/version-2
2013-02-07 01:42:09,342 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.70028d3e5
2013-02-07 01:42:13,407 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@256] - Snapshotting: 70028d3e5
2013-02-07 01:42:23,864 - INFO  [LearnerHandler-/] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@7ab2c6a6
2013-02-07 01:42:23,865 - INFO  [LearnerHandler-/] - Synchronizing with Follower sid: 1 maxCommittedLog =0 minCommittedLog = 0 peerLastZxid = 70028d3cb
2013-02-07 01:42:23,865 - INFO  [LearnerHandler-/] - Sending snapshot last zxid of peer is 0x70028d3cb  zxid of leader is 0x800000000sent zxid of db as 0x70028d3e5
2013-02-07 01:42:47,691 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Leader@413] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Waiting for a quorum of followers, only synced with: 2:
        at org.apache.zookeeper.server.quorum.Leader.shutdown(
        at org.apache.zookeeper.server.quorum.Leader.lead(

Tried googling on this, but didn't get any helpful hits. I tried all kinds of tricks to get the service up, then started looking into the source code, especially the lines that are mentioned in the exception message.

Turns out I had a misconfiguration. In /etc/zookeeper/zoo.cfg there's a parameter initLimit described as follows:

# The number of ticks that the initial
# synchronization phase can take

In my setup, I had the default value (10) set for this parameter. Looking at the administrator's guide for the version of zookeeper I'm running, it describes initLimit as follows:

"Amount of time, in ticks (see tickTime), to allow followers to connect and sync to a leader. Increased this value as needed, if the amount of data managed by ZooKeeper is large."

That particular ZooKeeper cluster has several hundred thousand objects, with a database size of roughly 150MiB. I guess that is counted as a large amount of data in ZooKeeper.

I increased my initLimit to 100, which made the problem go away, the server started fine and my cluster was able to go into a healthy state and start serving data again.

What happened here was that the server that was being elected as leader (with server id 2) was elected leader. It started sending a snapshot of the database to its follower (with server id 1), but before that completed and the follower reported itself as ready and following, the initLimit timeout was reached, and the leader thread decided it had to give up, since it was only synced with server id 2 (itself). So increasing initLimit to a value that allowed the snapshot transfer to complete fixed this problem.