Feeds: RSS | Atom

Zookeeper + Kerberos Ticket Cache

Published: 2015-04-30 09:49 UTC. Tags: java kerberos

Was trying to connect to zookeeper with kerberos authentication. This was giving me the following "helpful" error message:

[main-SendThread($ClientCallbackHandler@459] -
Could not login: the client is being asked for a password, but the Zookeeper client code does
not currently support obtaining a password from the user. Make sure that the client is
configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)'
and restart the client. If you still get this message after that, the TGT in the ticket cache
has expired and must be manually refreshed. To do so, first determine if you are using a
password or a keytab. If the former, run kinit in a Unix shell in the environment of the
user who is running this Zookeeper client using the command 'kinit <princ>'
(where <princ> is the name of the client's Kerberos principal). If the latter,
do 'kinit -k -t <keytab> <princ>' (where <princ> is the name of the Kerberos principal,
and <keytab> is the location of the keytab file). After manually refreshing your cache,
restart this client. If you continue to see this message after manually refreshing your cache,
ensure that your KDC host's clock is in sync with this host's clock.

[Krb5LoginModule] authentication failed

I was using the following command line:

CLIENT_JVMFLAGS="" zookeeper-client -server

With jaas.conf containing:

Client { required
Not a very helpful error message. Adding to the cmdline gave more information::
unsupported key type found the default TGT: 18

This combined with made me realize I had forgotten to install the "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files" on this particular machine.

Problem solved.


Fancontrol not working after resume on Ubuntu

Published: 2013-03-02 08:18 UTC. Tags: linux

Bought a new computer. Had some trouble with the fan controller built into the chassis, so got a couple of PWM fans instead since the motherboard can control 1 CPU and 3 chassis PWM fans.

The BIOS however was a bit limited when it came to how slow you could make the fan run. So turned to the fancontrol package in ubuntu, and after some fiddling it worked as intended, even turning off the case fan when the temperature was below the configured threshold.

However, after suspending then resuming, the fan would go at 100% again, and not spin down. There's a launchpad bug that tells me I'm not the only one with this problem.

Here's a workaround. Create /etc/pm/sleep.d/20_fancontrol with the following contents:


case "${1}" in
      /usr/sbin/service fancontrol restart

This will restart the fancontrol service after resume, which solves the problem. The fan will run at 100% for a little while at resume, since it takes a couple of seconds before this script is being run.


ZooKeeper failing to elect leader due to initLimit being too small

Published: 2013-02-09 19:00 UTC. Tags: zookeeper

At work, I use Apache ZooKeeper to coordinate a distributed service. I find ZooKeeper very easy to work with and program against, but as all software, it can be troublesome now and then.

I have a 3-node ZooKeeper cluster that was behaving very oddly the other day. It started with one of the nodes going down due to hardware trouble. This is supposed to be no problem since 2/3 nodes are still up and form a quorum. However, the whole service stopped serving clients.

At the time the node that went down crashed, it was the LEADING node of the cluster, with server id being 3. This meant another node needed to be elected as LEADER. The node with server id=2 was elected as leader, but failed to successfully establish leadership with a rather confusing error message in the log (/var/log/zookeeper/zookeeper.log):

2013-02-07 01:42:09,336 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@655] - LEADING
2013-02-07 01:42:09,336 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@154] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/zookeeper/version-2 snapdir /var/zookeeper/version-2
2013-02-07 01:42:09,342 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /var/zookeeper/version-2/snapshot.70028d3e5
2013-02-07 01:42:13,407 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@256] - Snapshotting: 70028d3e5
2013-02-07 01:42:23,864 - INFO  [LearnerHandler-/] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@7ab2c6a6
2013-02-07 01:42:23,865 - INFO  [LearnerHandler-/] - Synchronizing with Follower sid: 1 maxCommittedLog =0 minCommittedLog = 0 peerLastZxid = 70028d3cb
2013-02-07 01:42:23,865 - INFO  [LearnerHandler-/] - Sending snapshot last zxid of peer is 0x70028d3cb  zxid of leader is 0x800000000sent zxid of db as 0x70028d3e5
2013-02-07 01:42:47,691 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Leader@413] - Shutdown called
java.lang.Exception: shutdown Leader! reason: Waiting for a quorum of followers, only synced with: 2:
        at org.apache.zookeeper.server.quorum.Leader.shutdown(
        at org.apache.zookeeper.server.quorum.Leader.lead(

Tried googling on this, but didn't get any helpful hits. I tried all kinds of tricks to get the service up, then started looking into the source code, especially the lines that are mentioned in the exception message.

Turns out I had a misconfiguration. In /etc/zookeeper/zoo.cfg there's a parameter initLimit described as follows:

# The number of ticks that the initial
# synchronization phase can take

In my setup, I had the default value (10) set for this parameter. Looking at the administrator's guide for the version of zookeeper I'm running, it describes initLimit as follows:

"Amount of time, in ticks (see tickTime), to allow followers to connect and sync to a leader. Increased this value as needed, if the amount of data managed by ZooKeeper is large."

That particular ZooKeeper cluster has several hundred thousand objects, with a database size of roughly 150MiB. I guess that is counted as a large amount of data in ZooKeeper.

I increased my initLimit to 100, which made the problem go away, the server started fine and my cluster was able to go into a healthy state and start serving data again.

What happened here was that the server that was being elected as leader (with server id 2) was elected leader. It started sending a snapshot of the database to its follower (with server id 1), but before that completed and the follower reported itself as ready and following, the initLimit timeout was reached, and the leader thread decided it had to give up, since it was only synced with server id 2 (itself). So increasing initLimit to a value that allowed the snapshot transfer to complete fixed this problem.


Java SIGBUS - an unclear way of saying /tmp is full

Published: 2011-05-02 19:27 UTC. Tags: linux java

I had the following happen for every new java process on one of my servers the other day:

server:~$ java
# A fatal error has been detected by the Java Runtime Environment:
#  SIGBUS (0x7) at pc=0x00007f3e0c5aad9b, pid=17280, tid=139904457242368
# JRE version: 6.0_24-b07
# Java VM: Java HotSpot(TM) 64-Bit Server VM (19.1-b02 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  []  memset+0xa5b
# An error report file with more information is saved as:
# /home/user/hs_err_pid17280.log
Segmentation fault

Turns out this is Java's way of telling you that the /tmp directory is full. It's trying to mmap some performance/hotspot-related file in /tmp which succeeds, but when it's trying to access this area, it will get the SIGBUS signal.

More info here


Hadoop Streaming Error Codes

Published: 2011-01-31 08:12 UTC. Tags: hadoop

I'm using Hadoop Streaming a lot. It's exit codes has been something of a mystery, so today I decided to find out by looking at the source code.

The exit codes are listed in, and are as follows:

  1. Success
  2. Job not successful, i.e. something went wrong with M/R code.
  3. Bad input path
  4. Invalid jobconf
  5. Output path already exists
  6. Error launching job. Could be any error, for example some HDFS communication error.

Continous Integration with Hudson - embarrasingly simple!

Published: 2011-01-27 19:24 UTC. Tags: open source software testing

I'm working on a rather large reporting and analytics application that runs on top of Hadoop at work. It has tests. A whole bunch of them, actually. That's good.

So far, we've been running the tests manually when making new releases. But doing it more often is always better, since it gives you an indication on when things went wrong, and also forces you to keep your tests in a state where they pass. Some people call it Continous Integration.

Now, you can do all the work getting your builds to build and run tests yourself, via cron and scripts and other types of messiness. Or you can try an existing solution. Today I decided to try Hudson.

That turned out to be embarrasingly simple to get started with. Basically, it's a matter of:

  1. Downloading hudson.war from their site.
  2. Start it by running java -jar hudson.war
  3. Go to http://localhost:8080 with a web browser of your choice. That would be Opera in my case. You have to eat your own dog-food.
  4. Go to the Hudson management screen and enable the git plugin
  5. Setup a new project. Tell it where the code is and on which branch.
  6. Configure what commands to run to build and test. Make the test command output an xunit xml file.
  7. Tell Hudson where that xml file is.

Result: Hudson will periodically poll git and run my build and test commands, then show a changelog and what tests failed. All this after 30 minutes of setup time. I'm impressed.


Slow Puppetmaster? Check your reverse DNS

Published: 2011-01-13 19:26 UTC. Tags: puppet

Yesterday some of the servers I care for at work were moved to a different network. After the move, all puppetd runs started to take a very long time. Where it would usually take 10-15 seconds, it now timed out with errors like:

Jan 12 19:39:16 host1 puppetd[15760]: Calling puppetmaster.getconfig
Jan 12 19:41:16 host1 puppetd[15760]: Configuration retrieval timed out

(Note the two minutes between the informational message about calling puppetmaster.getconfig, and the timeout)

Highly confusing, especially since puppetd was slow not only on hosts which had moved to the new network, but also on hosts which had not moved.

The reason turned out to be slow reverse DNS for the new network range. Puppetmaster it seems is doing lot's and lot's of DNS lookups for clients, and that seems to be a synchronous operation. I think what caused all hosts to slow down was that puppetmaster got busy looking up one of the hosts on the new network, and that would cause the request from a host that had not moved to be put on hold.

Fixing the DNS issue solved the problem.

This is on puppet 0.24.5. Later versions might have a better behaviour.


Hadoop lesson learnt: Restart datanodes after modifying dfs.balance.bandwidthPerSec

Published: 2010-09-10 13:17 UTC. Tags: hadoop

I was rebalancing one of the Hadoop clusters I run at work. It was not running very fast, so I modified the appropriate setting:

  <!-- 100Mbit/s -->

I restarted the namenode and thought that would make the trick. But no, you also need to restart all your datanodes for the setting to take effect. Now I can see some action on my network graphs :-).


Whenever You Need a Random Password

Published: 2010-04-14 20:28 UTC. Tags: open source

apt-get install pwgen


Command Line Copy and Paste in Gnome Terminal

Published: 2010-04-10 11:08 UTC. Tags: linux

In the category Stuff I really should have learned several years ago, I now know that the keyboard combinations for copying and pasting in gnome-terminal is Shift-Control-C and Shift-Control-V

Now, if I could find out how to do select text without using the mouse...


Forsberg's Law on Cron Jobs

Published: 2010-02-19 09:45 UTC. Tags: software

They never work as intended the first four times you run them.


Backup of MySQL via phpMyAdmin

Published: 2010-02-06 19:51 UTC. Tags: misctools python

My girlfriend runs a blog on a cheap hosting firm that doesn't provide any way of doing proper SQL dumps of the MySQL database used by the blogging software.

There are plugins for Wordpress that can do full backups, but I prefer doing raw SQL dumps + a filesystem backup. That way, you know what you get, you don't have to trust the backup plugin author to do it right.

The hosting firm does provide access to a phpMyAdmin installation which you can use to download SQL dumps. The trick is of course to do this automatically, as good backups need to be unattended.

I wrote a python program that can do this, using what turned out to be an excellent library for programmatic web browsing: mechanize.

The backup script is available in my misctools project on GitHub.


Easy Update of Slicehost DNS Entries

Published: 2010-02-06 18:43 UTC. Tags: misctools python

This website runs on a virtual machine I buy from Slicehost. I've also choosen to use their DNS servers for my domain - the service is stable and included in the price.

The Slicehost DNS can be modified using the Slicehost API. I wrote two small scripts for easy modification of Slicehost DNS entries from the commandline or from scripts.

  • update_entry, for adding or updating existing entries.
  • dhclient_update_hook, which very easily can be used to update an entry from a dhclient script, to keep records that point to dynamic adressess updated automatically.

Both are available from by cloning my misctools project at GitHub.


PostgreSQL/Python/psycopg2: Confusing error, port setting required for socket connections

Published: 2010-02-06 13:43 UTC. Tags: django python

When trying to get my local development copy of this website running after upgrading my Ubuntu, I got the following confusing error message from the psycopg2 python module:

psycopg2.OperationalError: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket  "/var/run/postgresql/.s.PGSQL.5432"?

My django settings file was correct:

DATABASE_ENGINE = 'postgresql_psycopg2'  # 'postgresql_psycopg2', 'postgresql', 'mysql', 'sqlite3' or 'oracle'.
DATABASE_NAME = 'dbname'                 # Or path to database file if using sqlite3.
DATABASE_USER = 'dbuser'                 # Not used with sqlite3.
DATABASE_PASSWORD = 'dbpassword'         # Not used with sqlite3.
DATABASE_HOST = ''                       # Set to empty string for localhost. Not used with sqlite3.
DATABASE_PORT = ''                       # Set to empty string for default. Not used with sqlite3.

Confusing, since my Postgres server was running and I could connect using psql:

psql -U dbuser -W dbname

This turned out to be one of these problems when Google is of no help - others had the same problem, but I could only find posts where people asked the question, no posts where the actual solution was found.

The cause of the problem was that my PostgreSQL installation was configured to listen on port 5433 instead of the default 5432, and as seen in the error message, the port number is part of the path to the unix socket. The different port was probably setup when I upgraded my Ubuntu, since that installed PostgreSQL 8.4 without completely removing PostgreSQL 8.3. The latter is configured to listen on the default port.

The solution is to either configure the running PostgreSQL to listen on port 5432 by modifying /etc/postgresql/8.4/main/postgresql.conf, or by modifying the Django configuration by setting the port:


Deleting Amazon S3 buckets using Python

Published: 2009-08-09 10:38 UTC. Tags: software misctools

For a while, I used Duplicity to make backups to an Amazon S3 bucket. That kind of worked, but I had to do a lot of scripting myself to get it working automatically, so after finding out about Jungledisk, I switched to that. Jungledisk has a nice little desktop applet that keeps track of doing my backups while my computer is on, etc. That's convenient.

Anyway, the Duplicity/S3 experiments left me with an Amazon S3 bucket with about 9000 objects. Getting rid of that proved to be something of a challenge - you have to delete all objects inside the bucket before you can delete the bucket itself, and there's no API call for doing that. I also tried the web application for managing buckets, S3FM but that didn't cope too well with that many objects - my web browser just hung.

I have to admit I could have put more effort into googling before solving it by writing my own script - but writing my own script was more fun :-).

My script managed to delete all 9000 objects without trouble, although it did take quite a while to complete - I let it run overnight.

If you need to do the same thing, it's available here:

StackOverflow has several other solutions: