Beginners view of Hadoop MiniDFSCluster

If you are new to Hadoop source code and  you want to write Test-driven development code then MiniDfsCluster is what you can use for your first step.

Although there are many  Hadoop developers who will argue that using MiniDFSCluster is not an excellent way to write unit tests for Hadoop. And there are many other efficient  ways (e.g Using Mock objects – Mokito )  for writing unit tests for Hadoop. We will discuss about this in some other post.

MiniDfsCluster  class creates a single-process DFS cluster for Junit testing which includes non-simulated DFS and simulated DFS.  The data directories for non-simulated DFS are under the testing directory ( /build/test/data ) . And for simulated data nodes, no underlying fs storage is used.

MiniDfsCluster is mostly used in following four ways

1. public MiniDFSCluster() {}

This null constructor is used only when wishing to start a data node cluster  without a name node (ie when the name node is started elsewhere).

2. public MiniDFSCluster(Configuration conf, int numDataNodes, StartupOption nameNodeOperation)

Modify the config and start up the servers with the given operation. Servers will be started on free ports. The caller must manage the creation of      NameNode and DataNode directories and have already set dfs.name.dir and dfs.data.dir in the given conf.

Here

conf the base configuration to use in starting the servers.  This will be modified as necessary.

numDataNodes Number of DataNodes to start; may be zero

nameNodeOperation the operation with which to start the servers.  If null or StartupOption.FORMAT, then StartupOption.REGULAR will be used.

3. public MiniDFSCluster(Configuration conf,int numDataNodes,boolean format,String[] racks)

Modify the config and start up the servers.  The rpc and info ports for  servers are guaranteed to use free ports. NameNode and DataNode directory creation and configuration will be  managed by this class.

Here :

conf the base configuration to use in starting the servers.  This will be modified as necessary.

numDataNodes Number of DataNodes to start; may be zero

format if true, format the NameNode and DataNodes before starting up

racks array of strings indicating the rack that each DataNode is on

4. public MiniDFSCluster(Configuration conf,int numDataNodes,boolean format,String[] racks,String[] hosts)

Modify the config and start up the servers.  The rpc and info ports for  servers are guaranteed to use free ports. NameNode and DataNode directory creation and configuration will be  managed by this class.

Here :

conf the base configuration to use in starting the servers.  This will be modified as necessary.

numDataNodes Number of DataNodes to start; may be zero

format if true, format the NameNode and DataNodes before starting up

racks array of strings indicating the rack that each DataNode is on

hosts array of strings indicating the hostname for each DataNode

Below is the simple example in which we configure and start MiniDfsCluster

Continue reading