I find myself typing the most absurd search strings (they read like lexical Tourette’s or XKCD passwords):
pseudocluster ubuntu precise pangolin cloudera cdh3
I spent a while getting my new laptop set up with a Cloudera CDH3 Hadoop pseudo-distributed cluster. But I really wish I’d had the following instructions, simplified off the web with some help from some of my friends.
I hope these are helpful to someone besides me.
Install the Oracle JDK
sudo apt-add-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java6-installer
edit all the users’ default environment to use Java 6 Oracle flavor:
sudoedit /etc/profile.d/java_home.sh
include the following content:
if [ "${JAVA_HOME+is_set}" = is_set ]; then return; fi export JAVA_HOME=/usr/lib/jvm/java-6-oracle export PATH=${JAVA_HOME}/bin:${PATH}
I recommend rebooting at this point. Do not edit /etc/environment
to set these, unless you want to spend a day evaluating mysterious problems.
Do not skip this step. Installation of the repo below depends on JAVA_HOME
being set for the root user at install time.
Read from the Cloudera Debian repo
Update your key manager to handle Cloudera’s signature:
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
Recommended: install Cloudera’s repository-setting debian package. I’m using their Maverick package, because they haven’t set one up for Pangolin.
Alternatively, include their sources list manually:
sudoedit /etc/apt/sources.list.d/cloudera-cdh3.list
use the following content:
deb http://archive.cloudera.com/debian maverick-cdh3u5 contrib deb deb-src http://archive.cloudera.com/debian maverick-cdh3u5 contrib
Install the missing dependencies
for reasons unknown, there’s at least one library missing from Ubuntu Precise, and the CDH3 packages insist on it.
sudo apt-get update sudo apt-get install libzip1
If this doesn’t work (and it may not for you):
wget http://launchpadlibrarian.net/48191694/libzip1_0.9.3-1_amd64.deb sudo dpkg -i libzip1_0.9.3-1_amd64.deb
Install the pseudo-cluster and the native libraries:
sudo apt-get update sudo apt-get install hadoop-0.20-conf-pseudo hadoop-0.20-native
Tell the configuration where Java is
Even the funky work you did above to modify /etc/profile.d/java_home.sh
doesn’t get Hadoop in the right environment for non-interactive shells, which are where the services run. You can grief yourself by tweaking /etc/environment
, like I did (pro tip: don’t do that), or just tune your Hadoop configuration:
sudoedit /etc/hadoop-0.20/conf.pseudo/hadoop-env.sh
and include the line:
export JAVA_HOME=/usr/lib/jvm/java-6-oracle
Confirm that pseudo-HDFS exists
$ ls /var/lib/hadoop-0.20/ cache
Start the services
$ for service in /etc/init.d/hadoop-0.20-* > do sudo $service start; done
See the web consoles
- The NameNode provides a web console at http://localhost:50070/ for HDFS.
- The JobTracker provides a web console at http://localhost:50030/ for what’s active.
Run a test job
$ hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar pi 2 100000 ...
Leave a Reply