Best way of learning anything is by doing it. To master Hadoop ecosystem you need to go beyond Word Count program. Here are list of some projects which I think of working on if I get time. This can be a good list of class projects for Hadoop.

1) Matrix Decomposition routines (QR, Cholesky etc)

- Numerical Recipes: http://www.nr.com/
- Matrix factorization algorithms: http://bickson.blogspot.com/2011…

2) Decision Trees with ID3, C4.5 or other heuristic (https://issues.apache.org/jira/b… ).

Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):

https://cwiki.apache.org/MAHOUT/…

https://cwiki.apache.org/MAHOUT/…

https://cwiki.apache.org/MAHOUT/…

3) Linear Regression https://cwiki.apache.org/conflue… ,

Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord…

4) Gradient Descent and other optimization and linear programming algorithms, seeConvex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to… Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?

5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada…

6) SVM:

https://issues.apache.org/jira/b…

https://issues.apache.org/jira/b…

https://issues.apache.org/jira/b…

Support Vector Machines: What is the best way to implement an SVM using Hadoop?

7) Vector space models http://en.wikipedia.org/wiki/Vec…

8) Hidden Markov Models – an extremely popular method in NLP & bioinformatics.

9) Slope One by Daniel Lemire: http://en.wikipedia.org/wiki/Slo… or otherCollaborative Filtering algorithms.

See *Mahout in Action *by* **Sean Owen*:http://www.manning.com/owen/

10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to… , Image Processing toolbox: http://www.mathworks.com/help/to… Wavelet Toolbox http://www.mathworks.com/help/to… also see OpenCV catalog: http://opencv.willowgarage.com/w…

11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u…

12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap…

13) For a wealth of open ended problems see Programming Challenges: What are some good “toy problems” in data science?

Notes:

- See Jimmy Lin’s book
*Data-Intensive Text Processing with MapReduce*for some good tips: http://www.umiacs.umd.edu/~jimmy… and Tom White‘s book on Hadoop: http://www.hadoopbook.com/ *Map-Reduce for Machine Learning on Multicore*by Chu et al.: www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf*Mining of Massive Datasets*, by Jeffrey Ullman: http://infolab.stanford.edu/~ull…- Muthu Muthukrishnan’s resources: http://www.cs.rutgers.edu/~muthu…
- Top 10 algorithms in data mining: http://www.mendeley.com/research…
- Large Data Logistic Regression (with example Hadoop code): http://www.win-vector.com/blog/2…
- A Comparison of Eight MapReduce Languages: http://www.dataspora.com/2011/04…
- Seven data-mining algorithms which are 200-400x faster on GPUs:http://www.smedirector.com/2010/… via Michael E Driscoll
- RecLab Core by Darren Erik Vengroff: http://code.richrelevance.com/re…
- Amund Tveit‘s links: http://atbrox.com/2011/05/16/map…
- Jeff Hammerbacher‘s links: http://www.mendeley.com/groups/1…
- Scaling up machine learning: http://www.cs.umass.edu/~ronb/sc…
- Zero to Hadoop in 5 min with Common Crawl: http://www.commoncrawl.org/mapre…
- Antonio Piccolboni, Looking for a MapReduce language:http://blog.piccolboni.info/2011…
- Machine Learning: What are some good learning projects to teach oneself about machine learning?