Problem : You have multiple Hadoop clusters running and you want to transfer several tera bytes of data from one cluster to another.
Solution : DistCp – Distributed copy.
It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size 🙂 ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.
To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
You can also specify multiple source directories on the command line:
bash$ hadoop distcp hdfs://nn1:8020/foo/a \
Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
Where srclist contains
Click here to learn more about DistCp