Hadoop cookbook – 1. How to transfer data between different HDFS clusters.

Problem : You have multiple Hadoop clusters running and you want to  transfer  several tera bytes of data from one cluster to another.

Solution : DistCp – Distributed copy.

It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size 🙂  ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.

To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

You can also specify multiple source directories on the command line:

bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
hdfs://nn2:8020/bar/foo

Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
hdfs://nn2:8020/bar/foo

Where srclist contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b

Click here to learn more about DistCp