Class projects for Hadoop

Best way of learning anything is by doing it. To master Hadoop ecosystem you need to go beyond Word Count program. Here are list of some projects which I think of working on if I get time. This can be a good list of class projects for Hadoop.

1) Matrix Decomposition routines (QR, Cholesky etc)

2) Decision Trees with ID3, C4.5 or other heuristic (https://issues.apache.org/jira/b… ).

Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/MAHOUT/…
https://cwiki.apache.org/MAHOUT/…
https://cwiki.apache.org/MAHOUT/…

3) Linear Regression https://cwiki.apache.org/conflue… ,

Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord…

4) Gradient Descent and other optimization and linear programming algorithms, seeConvex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to… Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?

5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada…

6) SVM:

https://issues.apache.org/jira/b…

https://issues.apache.org/jira/b…

https://issues.apache.org/jira/b…

Support Vector Machines: What is the best way to implement an SVM using Hadoop?

7) Vector space models http://en.wikipedia.org/wiki/Vec…

8) Hidden Markov Models – an extremely popular method in NLP & bioinformatics.

9) Slope One by Daniel Lemirehttp://en.wikipedia.org/wiki/Slo… or otherCollaborative Filtering algorithms.

See Mahout in Action by Sean Owen:http://www.manning.com/owen/

10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to… ,  Image Processing toolbox: http://www.mathworks.com/help/to…  Wavelet Toolbox http://www.mathworks.com/help/to… also see OpenCV catalog: http://opencv.willowgarage.com/w…

11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u…

12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap…

13) For a wealth of open ended problems see Programming Challenges: What are some good “toy problems” in data science?

Notes:

Advertisements

Mozilla Foundation(makers of Firefox,Thunderbird) using Hadoop & HBase.

This month’s Bay area HBase user group meet up was held at Mozilla Foundation in Mountain View. (Mozilla foundation is known for there open source products Firefox, Thunderbird, Bugzilla etc).

Andrew Purtel presented interesting talk on installing HBase using Amazon EC2 cluster. Mozilla’s metrics engineer Daniel talked about how  HBase and Hadoop are used at Mozilla corporation.

Daniel Einspanjer who is Metrics Software Engineer at Mozilla Corporation mentioned Mozilla is using Hadoop and HBase for generating usage and crash report metrics. According to Daniel there are more than 350 million Firefox users who reports around 1 million crash reports every day. These crash reports are valuable information for developers, which can be used for debugging purposes. Also these metrics includes information about plugin usage which is valuable for plugin developers.

Mozilla foundation wants to process these crash report data so that they can share important usage and debug information with developers and users. Currently Mozilla is using around 10-15% of crash reports received daily.

By processing these crash reports Mozilla will be able to gain information such as

  1. Total number of Firefox users.
  2. Add-on update pings.
  3. Average add-ons used by per user.

  • Source : http://blog.mozilla.com/data/tag/hadoop/
  • Mozilla aims to process more than 15 % crash reports. To achieve this goal they are  building a system to store crashes data for long time and process these data using Hadoop and HBase. Currently Mozilla is running production cluster with Hadoop 0.20 on 20 nodes to process crash report data.

Beginners view of Hadoop MiniDFSCluster

If you are new to Hadoop source code and  you want to write Test-driven development code then MiniDfsCluster is what you can use for your first step.

Although there are many  Hadoop developers who will argue that using MiniDFSCluster is not an excellent way to write unit tests for Hadoop. And there are many other efficient  ways (e.g Using Mock objects – Mokito )  for writing unit tests for Hadoop. We will discuss about this in some other post.

MiniDfsCluster  class creates a single-process DFS cluster for Junit testing which includes non-simulated DFS and simulated DFS.  The data directories for non-simulated DFS are under the testing directory ( /build/test/data ) . And for simulated data nodes, no underlying fs storage is used.

MiniDfsCluster is mostly used in following four ways

1. public MiniDFSCluster() {}

This null constructor is used only when wishing to start a data node cluster  without a name node (ie when the name node is started elsewhere).

2. public MiniDFSCluster(Configuration conf, int numDataNodes, StartupOption nameNodeOperation)

Modify the config and start up the servers with the given operation. Servers will be started on free ports. The caller must manage the creation of      NameNode and DataNode directories and have already set dfs.name.dir and dfs.data.dir in the given conf.

Here

conf the base configuration to use in starting the servers.  This will be modified as necessary.

numDataNodes Number of DataNodes to start; may be zero

nameNodeOperation the operation with which to start the servers.  If null or StartupOption.FORMAT, then StartupOption.REGULAR will be used.

3. public MiniDFSCluster(Configuration conf,int numDataNodes,boolean format,String[] racks)

Modify the config and start up the servers.  The rpc and info ports for  servers are guaranteed to use free ports. NameNode and DataNode directory creation and configuration will be  managed by this class.

Here :

conf the base configuration to use in starting the servers.  This will be modified as necessary.

numDataNodes Number of DataNodes to start; may be zero

format if true, format the NameNode and DataNodes before starting up

racks array of strings indicating the rack that each DataNode is on

4. public MiniDFSCluster(Configuration conf,int numDataNodes,boolean format,String[] racks,String[] hosts)

Modify the config and start up the servers.  The rpc and info ports for  servers are guaranteed to use free ports. NameNode and DataNode directory creation and configuration will be  managed by this class.

Here :

conf the base configuration to use in starting the servers.  This will be modified as necessary.

numDataNodes Number of DataNodes to start; may be zero

format if true, format the NameNode and DataNodes before starting up

racks array of strings indicating the rack that each DataNode is on

hosts array of strings indicating the hostname for each DataNode

Below is the simple example in which we configure and start MiniDfsCluster

Continue reading

Running Hadoop 0.20 Pseudo Distributed Mode on Mac OS X

Although Hadoop is developed for running distributed computing applications (Map Reduce) on commodity hardware it is possible to run Hadoop on single machine in pseudo distributed mode. Running Hadoop in psuedo distributed mode is first step towards running Hadoop in distributed mode.

To setup and run Hadoop in pseudo distributed mode you need Java 6 installed on your system , also make sure that Java home is added in system variables . Download Hadoop 0.20 from here .

Download and Extract Hadoop
Download and save hadoop-xx.xx.tar.gz . To extract hadoop zip file execute command

tar xvf hadoop-0.20.0.tar.gz

This should extract hadoop binary and source in hadoop-0.20.0 directory .

By default Hadoop is configured to run in Stand alone mode . To view hadoop commands and options execute bin/hadoop from Hadoop root directory .
You can see hadoop basic commands and options as below .

matrix:Hadoop rphulari$ bin/hadoop
Usage: hadoop [–config confdir] COMMAND
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar run a jar file
distcp copy file or directories recursively
archive -archiveName NAME * create a hadoop archive
daemonlog get/set the log level for each daemon
or
CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

Configuration changes
We are 5 steps away from running hadoop in pseudo distributed mode .

Step 1 – Configure conf/hadoop-env.sh

Update JAVA_HOME to point your system Java home directory . On Mac OS X it should point to /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/

Step 2 – Configure conf/hdfs-site.xml
Add following to conf/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1>/value>
</property>
</configuration>

Step 3 – Configure conf/core-site.xml
Add following to conf/core-site.xml

<configuration>
<property>
<name>fs.default.name</name>
hdfs://localhost:9000
</property>
</configuration>


Step 4 – Configure conf/mapred-site.xml
Add following to conf/mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

Now you are all set to start Hadoop in pseudo distributed mode . You can either start all hadoop process (hdfs and mapred processes ) using bin/start-all.sh from hadoop root directory or you can start only hdfs – bin/start-hdfs.sh or only map reduce process – bin/start-mapred.sh .
Before starting hadoop dfs (Distributed file system ) we need to format it using namenode format command .
matrix:Hadoop rphulari$ bin/hadoop namenode -format
this will print lot of information on screen which include Hadoop version , host name and ip address , namenode storage directory which is by default set to /tmp/hadoop-$username.
Once hdfs is formatted and ready for use we execute bin/start-all.sh to start all process.
If you execute bin/start-all.sh all hadoop process will start and you can see log of starting job tracker , task tracker , namenode ,datanode on screen.
You can also make sure if all process are running by executing java jps command .
matrix-lm:Hadoop rphulari$ jps
12543 DataNode
12776 Jps
12677 JobTracker
12755 TaskTracker
12619 SecondaryNameNode

Playing with HDFS shell
HDFS – hadoop distributed file system is very similar to unix / posix file system . HDFS also gives same shell commands to do file system operations like mkdir , ls , du etc .
HDFS – ls
HDFS ls is part of hadoop fs (file system) which can be executed as following , which shows contents of root ( / ) directory .
matrix:Hadoop rphulari$ bin/hadoop fs -ls /
Found 1 items
drwxr-xr-x – rphulari supergroup 0 2009-05-13 22:04 /tmp
NOTE – By default hdfs starter , name node formatter is superuser of hdfs .
HDFS – mkdir
To create a dir on hdfs use fs -mkdir .

matrix:Hadoop rphulari$ bin/hadoop fs -mkdir user
matrix:Hadoop rphulari$ bin/hadoop fs -ls /
Found 2 items
drwxr-xr-x – rphulari supergroup 0 2009-05-13 22:04 /tmp
drwxr-xr-x – rphulari supergroup 0 2009-05-13 22:06 /user
You can find complete list of hadoop shell commands here
In next blogs we will execute first map reduce program on hadoop .