Adventures in Data

Class projects for Hadoop

Best way of learning anything is by doing it. To master Hadoop ecosystem you need to go beyond Word Count program. Here are list of some projects which I think of working on if I get time. This can be a good list of class projects for Hadoop.

1) Matrix Decomposition routines (QR, Cholesky etc)

2) Decision Trees with ID3, C4.5 or other heuristic (https://issues.apache.org/jira/b… ).

Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/MAHOUT/…
https://cwiki.apache.org/MAHOUT/…
https://cwiki.apache.org/MAHOUT/…

3) Linear Regression https://cwiki.apache.org/conflue… ,

Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord…

4) Gradient Descent and other optimization and linear programming algorithms, seeConvex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to… Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?

5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada…

6) SVM:

https://issues.apache.org/jira/b…

https://issues.apache.org/jira/b…

https://issues.apache.org/jira/b…

Support Vector Machines: What is the best way to implement an SVM using Hadoop?

7) Vector space models http://en.wikipedia.org/wiki/Vec…

8) Hidden Markov Models - an extremely popular method in NLP & bioinformatics.

9) Slope One by Daniel Lemirehttp://en.wikipedia.org/wiki/Slo… or otherCollaborative Filtering algorithms.

See Mahout in Action by Sean Owen:http://www.manning.com/owen/

10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to… ,  Image Processing toolbox: http://www.mathworks.com/help/to…  Wavelet Toolbox http://www.mathworks.com/help/to… also see OpenCV catalog: http://opencv.willowgarage.com/w…

11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u…

12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap…

13) For a wealth of open ended problems see Programming Challenges: What are some good “toy problems” in data science?

Notes:

Written by Ravi

June 4, 2013 at 9:52 pm

Follow

Get every new post delivered to your Inbox.