Running Hadoop 0.20 Pseudo Distributed Mode on Mac OS X

Although Hadoop is developed for running distributed computing applications (Map Reduce) on commodity hardware it is possible to run Hadoop on single machine in pseudo distributed mode. Running Hadoop in psuedo distributed mode is first step towards running Hadoop in distributed mode.

To setup and run Hadoop in pseudo distributed mode you need Java 6 installed on your system , also make sure that Java home is added in system variables . Download Hadoop 0.20 from here .

Download and Extract Hadoop
Download and save hadoop-xx.xx.tar.gz . To extract hadoop zip file execute command

tar xvf hadoop-0.20.0.tar.gz

This should extract hadoop binary and source in hadoop-0.20.0 directory .

By default Hadoop is configured to run in Stand alone mode . To view hadoop commands and options execute bin/hadoop from Hadoop root directory .
You can see hadoop basic commands and options as below .

matrix:Hadoop rphulari$ bin/hadoop
Usage: hadoop [–config confdir] COMMAND
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar run a jar file
distcp copy file or directories recursively
archive -archiveName NAME * create a hadoop archive
daemonlog get/set the log level for each daemon
CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

Configuration changes
We are 5 steps away from running hadoop in pseudo distributed mode .

Step 1 – Configure conf/

Update JAVA_HOME to point your system Java home directory . On Mac OS X it should point to /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/

Step 2 – Configure conf/hdfs-site.xml
Add following to conf/hdfs-site.xml


Step 3 – Configure conf/core-site.xml
Add following to conf/core-site.xml


Step 4 – Configure conf/mapred-site.xml
Add following to conf/mapred-site.xml


Now you are all set to start Hadoop in pseudo distributed mode . You can either start all hadoop process (hdfs and mapred processes ) using bin/ from hadoop root directory or you can start only hdfs – bin/ or only map reduce process – bin/ .
Before starting hadoop dfs (Distributed file system ) we need to format it using namenode format command .
matrix:Hadoop rphulari$ bin/hadoop namenode -format
this will print lot of information on screen which include Hadoop version , host name and ip address , namenode storage directory which is by default set to /tmp/hadoop-$username.
Once hdfs is formatted and ready for use we execute bin/ to start all process.
If you execute bin/ all hadoop process will start and you can see log of starting job tracker , task tracker , namenode ,datanode on screen.
You can also make sure if all process are running by executing java jps command .
matrix-lm:Hadoop rphulari$ jps
12543 DataNode
12776 Jps
12677 JobTracker
12755 TaskTracker
12619 SecondaryNameNode

Playing with HDFS shell
HDFS – hadoop distributed file system is very similar to unix / posix file system . HDFS also gives same shell commands to do file system operations like mkdir , ls , du etc .
HDFS – ls
HDFS ls is part of hadoop fs (file system) which can be executed as following , which shows contents of root ( / ) directory .
matrix:Hadoop rphulari$ bin/hadoop fs -ls /
Found 1 items
drwxr-xr-x – rphulari supergroup 0 2009-05-13 22:04 /tmp
NOTE – By default hdfs starter , name node formatter is superuser of hdfs .
HDFS – mkdir
To create a dir on hdfs use fs -mkdir .

matrix:Hadoop rphulari$ bin/hadoop fs -mkdir user
matrix:Hadoop rphulari$ bin/hadoop fs -ls /
Found 2 items
drwxr-xr-x – rphulari supergroup 0 2009-05-13 22:04 /tmp
drwxr-xr-x – rphulari supergroup 0 2009-05-13 22:06 /user
You can find complete list of hadoop shell commands here
In next blogs we will execute first map reduce program on hadoop .

Introduction to Hadoop

Apache Hadoop is a free distributed software framework developed in Java , C++ and bash scripts that supports data intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google‘s MapReduce and Google File System (GFS) papers.

Hadoop is a top level Apache project, being built and used by a community of contributors from all over the world Yahoo! has been the largest contributor to the project and uses Hadoop extensively in its web search and advertising businesses.