Hadoop Cookbook – 2 , How to build Hadoop with my custom patch?

Problem : How do I build my own version of Hadoop with my custom patch.

Solution : Apply patch and build hadoop.

You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant,  Java 5 (for generating Documents), Apache Forrest (for generating documents).

Steps :

Checkout hadoop source code,

> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m “Hadoop-X.Y.Z-rcR.release.”

Apply your patch for checking it’s functionality using following command

> patch -p0 -E < ~/Path/To/Patch.patch

Ant test and compile source code with latest patch.

> ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest  compile-core compile-core tar

How to build documents.

> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs


Hadoop cookbook – 1. How to transfer data between different HDFS clusters.

Problem : You have multiple Hadoop clusters running and you want to  transfer  several tera bytes of data from one cluster to another.

Solution : DistCp – Distributed copy.

It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size 🙂  ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.

To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \


You can also specify multiple source directories on the command line:

bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \

Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \

Where srclist contains

Click here to learn more about DistCp