CDH3 on Ubuntu Precise Pangolin

I find myself typing the most absurd search strings (they read like lexical Tourette’s or XKCD passwords):

pseudocluster ubuntu precise pangolin cloudera cdh3

I spent a while getting my new laptop set up with a Cloudera CDH3 Hadoop pseudo-distributed cluster. But I really wish I’d had the following instructions, simplified off the web with some help from some of my friends.

I hope these are helpful to someone besides me.

Install the Oracle JDK

sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java6-installer

edit all the users’ default environment to use Java 6 Oracle flavor:

sudoedit /etc/profile.d/java_home.sh

include the following content:

if [ "${JAVA_HOME+is_set}" = is_set ]; then return; fi
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
export PATH=${JAVA_HOME}/bin:${PATH}

I recommend rebooting at this point. Do not edit /etc/environment to set these, unless you want to spend a day evaluating mysterious problems.

Do not skip this step. Installation of the repo below depends on JAVA_HOME being set for the root user at install time.

Read from the Cloudera Debian repo

Update your key manager to handle Cloudera’s signature:

curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -

Recommended: install Cloudera’s repository-setting debian package. I’m using their Maverick package, because they haven’t set one up for Pangolin.

Alternatively, include their sources list manually:

sudoedit /etc/apt/sources.list.d/cloudera-cdh3.list

use the following content:

deb http://archive.cloudera.com/debian maverick-cdh3u5 contrib
deb deb-src http://archive.cloudera.com/debian maverick-cdh3u5 contrib

Install the missing dependencies

for reasons unknown, there’s at least one library missing from Ubuntu Precise, and the CDH3 packages insist on it.

sudo apt-get update
sudo apt-get install libzip1

If this doesn’t work (and it may not for you):

wget http://launchpadlibrarian.net/48191694/libzip1_0.9.3-1_amd64.deb
sudo dpkg -i libzip1_0.9.3-1_amd64.deb

Install the pseudo-cluster and the native libraries:

sudo apt-get update
sudo apt-get install hadoop-0.20-conf-pseudo hadoop-0.20-native

Tell the configuration where Java is

Even the funky work you did above to modify /etc/profile.d/java_home.sh doesn’t get Hadoop in the right environment for non-interactive shells, which are where the services run. You can grief yourself by tweaking /etc/environment, like I did (pro tip: don’t do that), or just tune your Hadoop configuration:

sudoedit /etc/hadoop-0.20/conf.pseudo/hadoop-env.sh

and include the line:

export JAVA_HOME=/usr/lib/jvm/java-6-oracle

Confirm that pseudo-HDFS exists

$ ls /var/lib/hadoop-0.20/
cache

Start the services

$ for service in /etc/init.d/hadoop-0.20-*
> do sudo $service start; done

See the web consoles

The NameNode provides a web console at http://localhost:50070/ for HDFS.
The JobTracker provides a web console at http://localhost:50030/ for what’s active.

Run a test job

$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000
...

CDH3 on Ubuntu Precise Pangolin

Install the Oracle JDK

Read from the Cloudera Debian repo

Install the pseudo-cluster and the native libraries:

Tell the configuration where Java is

Confirm that pseudo-HDFS exists

Start the services

See the web consoles

Run a test job

Comments

One response to “CDH3 on Ubuntu Precise Pangolin”

Leave a Reply Cancel reply

CDH3 on Ubuntu Precise Pangolin

Install the Oracle JDK

Read from the Cloudera Debian repo

Install the pseudo-cluster and the native libraries:

Tell the configuration where Java is

Confirm that pseudo-HDFS exists

Start the services

See the web consoles

Run a test job

Share:

Comments

One response to “CDH3 on Ubuntu Precise Pangolin”

Leave a Reply Cancel reply