Adventures in Data

Archive for the ‘Hadoop cookbook’ Category

Hadoop Cookbook – 4, How to run multiple hadoop data nodes on one machine.

Although Hadoop is designed and developed for distributed computing it can be  run on  a single node in pseudo distributed mode  and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons.

If you want to feel Hadoop’s  distributed data node – name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it’s metadata , fsimage,edits , fstime  and how data node stores data blocks on local file system.

Steps

To start multiple data nodes on a single node first download / build hadoop binary.

  1. Download hadoop binary or build hadoop binary from hadoop source.
  2. Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)
  3. Add following script to the $HADOOP_HOME/bin directory and chmod it to 744.
  4. Format HDFS – bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode -format (for version > 0.21)
  5. Start HDFS  bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:50070
  6. Start additional data nodes using bin/run-additionalDN.sh

run-additionalDN.sh


#!/bin/sh
# This is used for starting multiple datanodes on the same machine.
# run it from hadoop-dir/ just like 'bin/hadoop' 

#Usage: run-additionalDN.sh [start|stop] dnnumber
#e.g. run-datanode.sh start 2

DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/"

if [ -z $DN_DIR_PREFIX ]; then
echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn"
exit 1
fi

run_datanode () {
DN=$2
export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logs
export HADOOP_PID_DIR=$HADOOP_LOG_DIR
DN_CONF_OPTS="\
-Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\
-Ddfs.datanode.address=0.0.0.0:5001$DN \
-Ddfs.datanode.http.address=0.0.0.0:5008$DN \
-Ddfs.datanode.ipc.address=0.0.0.0:5002$DN"
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS
}

cmd=$1
shift;

for i in $*
do
run_datanode  $cmd $i
done

Use jps or Namenode Web UI to verify if additional data nodes are started.

I started total 3 data nodes ( 2 additional data nodes) on my single node machine which are running on ports 50010,50011 and 50012 as shown in screen shot below.



Written by Ravi

May 27, 2010 at 1:21 am

Posted in Hadoop cookbook, HDFS

Hadoop Cookbook – 3, How to build your own Hadoop distribution.

Problem : You want to build your own Hadoop distribution.

Often you need particular feature added through patch in your Hadoop build and it’s still in trunk and  not available in Hadoop releases . In such cases you can build and distribute your own Hadoop distribution.

Solution: You can build your own version of Hadoop distribution by following steps given below.

1. Checkout latest released branch (lets say we want to work on Hadoop 0.20 branch)

  > svn checkout \
  http://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z/ hadoop-common-X.Y.Z

2. Download required patch

3. Apply required patch -> patch -p0 -E < /path/to/patch

4. Test patch

 ant \
  -Dpatch.file=/patch/to/my.patch \
  -Dforrest.home=/path/to/forrest/ \
  -Dfindbugs.home=/path/to/findbugs \
  -Dscratch.dir=/path/to/a/temp/dir \ (optional)
  -Dsvn.cmd=/path/to/subversion/bin/svn \ (optional)
  -Dgrep.cmd=/path/to/grep \ (optional)
  -Dpatch.cmd=/path/to/patch \ (optional)
  test-patch

5. Build Hadoop binary with documentation
ant -Djava5.home=$Java5Home -Dforrest.home=/path_to/apache-forrest
-Dfindbugs.home=/path_to/findbugs/latest compile-core tar

Successful completion of above command will create hadoop tar which can be used as hadoop distribution.

Written by Ravi

May 27, 2010 at 1:18 am

Posted in Hadoop cookbook

Hadoop Cookbook – 2 , How to build Hadoop with my custom patch?

Problem : How do I build my own version of Hadoop with my custom patch.

Solution : Apply patch and build hadoop.

You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant,  Java 5 (for generating Documents), Apache Forrest (for generating documents).

Steps :

Checkout hadoop source code,

> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m “Hadoop-X.Y.Z-rcR.release.”

Apply your patch for checking it’s functionality using following command

> patch -p0 -E < ~/Path/To/Patch.patch

Ant test and compile source code with latest patch.

> ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest  compile-core compile-core tar

How to build documents.

> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs

Written by Ravi

May 16, 2010 at 10:43 am

Posted in Hadoop, Hadoop cookbook

Tagged with

Hadoop cookbook – 1. How to transfer data between different HDFS clusters.

Problem : You have multiple Hadoop clusters running and you want to  transfer  several tera bytes of data from one cluster to another.

Solution : DistCp – Distributed copy.

It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size :)  ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.

To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

You can also specify multiple source directories on the command line:

bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
hdfs://nn2:8020/bar/foo

Where srclist contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b

Click here to learn more about DistCp

Written by Ravi

May 16, 2010 at 10:25 am

Posted in Hadoop, Hadoop cookbook, HDFS

Tagged with , ,

Follow

Get every new post delivered to your Inbox.