Hadoop Cookbook – 4, How to run multiple hadoop data nodes on one machine.

Although Hadoop is designed and developed for distributed computing it can be  run on  a single node in pseudo distributed mode  and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons.

If you want to feel Hadoop’s  distributed data node – name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it’s metadata , fsimage,edits , fstime  and how data node stores data blocks on local file system.

Steps

To start multiple data nodes on a single node first download / build hadoop binary.

  1. Download hadoop binary or build hadoop binary from hadoop source.
  2. Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)
  3. Add following script to the $HADOOP_HOME/bin directory and chmod it to 744.
  4. Format HDFS – bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode -format (for version > 0.21)
  5. Start HDFS  bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:50070
  6. Start additional data nodes using bin/run-additionalDN.sh

run-additionalDN.sh


#!/bin/sh
# This is used for starting multiple datanodes on the same machine.
# run it from hadoop-dir/ just like 'bin/hadoop' 

#Usage: run-additionalDN.sh [start|stop] dnnumber
#e.g. run-datanode.sh start 2

DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/"

if [ -z $DN_DIR_PREFIX ]; then
echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn"
exit 1
fi

run_datanode () {
DN=$2
export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logs
export HADOOP_PID_DIR=$HADOOP_LOG_DIR
DN_CONF_OPTS="\
-Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\
-Ddfs.datanode.address=0.0.0.0:5001$DN \
-Ddfs.datanode.http.address=0.0.0.0:5008$DN \
-Ddfs.datanode.ipc.address=0.0.0.0:5002$DN"
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS
}

cmd=$1
shift;

for i in $*
do
run_datanode  $cmd $i
done

Use jps or Namenode Web UI to verify if additional data nodes are started.

I started total 3 data nodes ( 2 additional data nodes) on my single node machine which are running on ports 50010,50011 and 50012 as shown in screen shot below.



Advertisements

Hadoop Cookbook – 3, How to build your own Hadoop distribution.

Problem : You want to build your own Hadoop distribution.

Often you need particular feature added through patch in your Hadoop build and it’s still in trunk and  not available in Hadoop releases . In such cases you can build and distribute your own Hadoop distribution.

Solution: You can build your own version of Hadoop distribution by following steps given below.

1. Checkout latest released branch (lets say we want to work on Hadoop 0.20 branch)

  > svn checkout \
  http://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z/ hadoop-common-X.Y.Z

2. Download required patch

3. Apply required patch -> patch -p0 -E < /path/to/patch

4. Test patch

 ant \
  -Dpatch.file=/patch/to/my.patch \
  -Dforrest.home=/path/to/forrest/ \
  -Dfindbugs.home=/path/to/findbugs \
  -Dscratch.dir=/path/to/a/temp/dir \ (optional)
  -Dsvn.cmd=/path/to/subversion/bin/svn \ (optional)
  -Dgrep.cmd=/path/to/grep \ (optional)
  -Dpatch.cmd=/path/to/patch \ (optional)
  test-patch

5. Build Hadoop binary with documentation
ant -Djava5.home=$Java5Home -Dforrest.home=/path_to/apache-forrest
-Dfindbugs.home=/path_to/findbugs/latest compile-core tar

Successful completion of above command will create hadoop tar which can be used as hadoop distribution.

Yahoo! giving away free tickets to 2010 Hadoop Summit.

Get ready for 3rd Hadoop Summit  which will be held on 29th June 2010 at Hyatt Regency  in Santa Clara .

Yahoo is giving free tickets for Hadoop summit 2010 to Hadoop Summit Retweet Contest winners. To win these tickets you just have to follow @YDN on twitter and keep eye on @YDN “fun fact” of the day tweet about Hadoop every Monday. Then on the same day, retweet “fun fact” about cloud computing along with the hash tag, #Y!Hadoop. All RTs must be received by 11:59 pm EST on the same Monday.Very next day Yahoo will randomly select one lucky winner to receive 2 complimentary tickets to the Hadoop Summit.

Click here for the official posting on YDN.

Hadoop Cookbook – 2 , How to build Hadoop with my custom patch?

Problem : How do I build my own version of Hadoop with my custom patch.

Solution : Apply patch and build hadoop.

You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant,  Java 5 (for generating Documents), Apache Forrest (for generating documents).

Steps :

Checkout hadoop source code,

> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m “Hadoop-X.Y.Z-rcR.release.”

Apply your patch for checking it’s functionality using following command

> patch -p0 -E < ~/Path/To/Patch.patch

Ant test and compile source code with latest patch.

> ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest  compile-core compile-core tar

How to build documents.

> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs

Hadoop cookbook – 1. How to transfer data between different HDFS clusters.

Problem : You have multiple Hadoop clusters running and you want to  transfer  several tera bytes of data from one cluster to another.

Solution : DistCp – Distributed copy.

It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size 🙂  ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.

To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

You can also specify multiple source directories on the command line:

bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
hdfs://nn2:8020/bar/foo

Where srclist contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b

Click here to learn more about DistCp

Why Hadoop ?

Why Hadoop is latest buzz word? Not only in valley but everywhere else. In Europe CERN scientist are evaluating HDFS file system for storing huge data (approximately 40TB / day ) generated by LHC to store and process on HDFS. In Asia China Mobile worlds biggest telecom giant is using Hadoop for processing huge data. And in Silicon Valley every other web company is already using Hadoop (Yahoo! , Facebook, LinkedIn, Netflix, IBM , Twitter , Zynga, Amazon….. list goes on) see complete list on Powered by page on Hadoop wiki.

Last month Microsoft announced  that they will be supporting Hadoop on Azure (Cloud computing platform by MS, competing with Amazon EC2 and S3).

There can’t be any argument that Hadoop gained this momentum because it’s an open source project. And credit goes to Yahoo! Although Hadoop was brain child of Dough Cutting (Formerly Yahoo! employee). Yahoo! spent tremendous resources in making Hadoop what it is today.  Yahoo! contributes more than 80% code of Hadoop.

Who said there is no free lunch?

Being open source project Hadoop offers  the best pricing for everybody which is FREE. Also very active community of Hadoop developers and users are useful resource for newbies.  The beauty of open source software is that not only it’s license and distribution is free but it’s support is also free by user and developer community.

Hadoop twiki  is documented with detailed information on Hadoop cluster setup and tutorials for Map reduce programming. You don’t have to pay anybody to setup your cluster or teach you how to write Map reduce programming.

Bend it like you want.

Open source  = source code is available to everyone. As Hadoop is an Apache project everybody can contribute in source code and everybody can express their opinion on how things should be done. As it’s source code is available to everyone , you can customize it as per your needs you can add your own functionalities and if you want you can contribute it back (Unfortunately there are some people out there who don’t contribute back to the community).

It’s  the complete package

Hadoop comes with complete package and there are many more things being added in this packaged , There are new applications being added on top of Hadoop which will be using Hadoop’s scalability and durability. Already there is NoSql stack rising for solving problems which traditional SQL can not solve. Hadoop supports other NoSql projects like Pig, Hive, Hbase which can be used for data mining, web mining  , ETL , BI .