My favorite features in Java 7
So far following are my favorite new features in Java 7, I wonder why these were not added in earlier version.
1. The try-with-resources Statement
Any object that implements java.lang.AutoCloseable, which includes all objects which implement java.io.Closeable, can be used as a resource.
static String readFirstLineFromFile(String path) throws IOException {
try (BufferedReader br = new BufferedReader(new FileReader(path))) {
return br.readLine();
}
}
In this example, the resource declared in the try-with-resources statement is a BufferedReader. The declaration statement appears within parentheses immediately after the try keyword. The class BufferedReader, in Java SE 7 and later, implements the interface java.lang.AutoCloseable. Because the BufferedReader instance is declared in a try-with-resource statement, it will be closed regardless of whether the try statement completes normally or abruptly (as a result of the method BufferedReader.readLine throwing an IOException).
Prior to Java SE 7, you can use a finally block to ensure that a resource is closed regardless of whether the try statement completes normally or abruptly. The following example uses a finally block instead of a try-with-resources statement:
static String readFirstLineFromFileWithFinallyBlock(String path) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(path));
try {
return br.readLine();
} finally { // no need of closing in finally clause in Java 7
if (br != null) br.close();
}
}
And you can declare more than one resource to close:
try (
InputStream in = new FileInputStream(src);
OutputStream out = new FileOutputStream(dest))
{
// code
}
2.Catching Multiple Exception Types and Rethrowing Exceptions with Improved Type Checking
In Java SE 7 and later, a single catch block can handle more than one type of exception. This feature can reduce code duplication and lessen the temptation to catch an overly broad exception.
Consider the following example, which contains duplicate code in each of the catch blocks:
catch (IOException ex) {
logger.log(ex);
throw ex; }
catch (SQLException ex) {
logger.log(ex);
throw ex;
}
In releases prior to Java SE 7, it is difficult to create a common method to eliminate the duplicated code because the variable ex has different types.
The following example, which is valid in Java SE 7 and later, eliminates the duplicated code:
catch (IOException|SQLException ex) {
logger.log(ex);
throw ex;
}
The catch clause specifies the types of exceptions that the block can handle, and each exception type is separated with a vertical bar (|).
Note: If a catch block handles more than one exception type, then the catch parameter is implicitly final. In this example, the catch parameter ex is final and therefore you cannot assign any values to it within the catch block.
Bytecode generated by compiling a catch block that handles multiple exception types will be smaller (and thus superior) than compiling many catch blocks that handle only one exception type each. A catch block that handles multiple exception types creates no duplication in the bytecode generated by the compiler; the bytecode has no replication of exception handlers.
3.Type Inference for Generic Instance Creation
You can replace the type arguments required to invoke the constructor of a generic class with an empty set of type parameters () as long as the compiler can infer the type arguments from the context. This pair of angle brackets is informally called the diamond.
For example, consider the following variable declaration:
Map<String, List> myMap = new HashMap<String, List>();
In Java SE 7, you can substitute the parameterized type of the constructor with an empty set of type parameters ():
Class projects for Hadoop
Best way of learning anything is by doing it. To master Hadoop ecosystem you need to go beyond Word Count program. Here are list of some projects which I think of working on if I get time. This can be a good list of class projects for Hadoop.
1) Matrix Decomposition routines (QR, Cholesky etc)
- Numerical Recipes: http://www.nr.com/
- Matrix factorization algorithms: http://bickson.blogspot.com/2011…
2) Decision Trees with ID3, C4.5 or other heuristic (https://issues.apache.org/jira/b… ).
Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/MAHOUT/…
https://cwiki.apache.org/MAHOUT/…
https://cwiki.apache.org/MAHOUT/…
3) Linear Regression https://cwiki.apache.org/conflue… ,
Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord…
4) Gradient Descent and other optimization and linear programming algorithms, seeConvex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to… Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?
5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada…
6) SVM:
https://issues.apache.org/jira/b…
https://issues.apache.org/jira/b…
https://issues.apache.org/jira/b…
Support Vector Machines: What is the best way to implement an SVM using Hadoop?
7) Vector space models http://en.wikipedia.org/wiki/Vec…
8) Hidden Markov Models - an extremely popular method in NLP & bioinformatics.
9) Slope One by Daniel Lemire: http://en.wikipedia.org/wiki/Slo… or otherCollaborative Filtering algorithms.
See Mahout in Action by Sean Owen:http://www.manning.com/owen/
10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to… , Image Processing toolbox: http://www.mathworks.com/help/to… Wavelet Toolbox http://www.mathworks.com/help/to… also see OpenCV catalog: http://opencv.willowgarage.com/w…
11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u…
12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap…
13) For a wealth of open ended problems see Programming Challenges: What are some good “toy problems” in data science?
Notes:
- See Jimmy Lin’s book Data-Intensive Text Processing with MapReduce for some good tips: http://www.umiacs.umd.edu/~jimmy… and Tom White‘s book on Hadoop: http://www.hadoopbook.com/
- Map-Reduce for Machine Learning on Multicore by Chu et al.: www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf
- Mining of Massive Datasets, by Jeffrey Ullman: http://infolab.stanford.edu/~ull…
- Muthu Muthukrishnan’s resources: http://www.cs.rutgers.edu/~muthu…
- Top 10 algorithms in data mining: http://www.mendeley.com/research…
- Large Data Logistic Regression (with example Hadoop code): http://www.win-vector.com/blog/2…
- A Comparison of Eight MapReduce Languages: http://www.dataspora.com/2011/04…
- Seven data-mining algorithms which are 200-400x faster on GPUs:http://www.smedirector.com/2010/… via Michael E Driscoll
- RecLab Core by Darren Erik Vengroff: http://code.richrelevance.com/re…
- Amund Tveit‘s links: http://atbrox.com/2011/05/16/map…
- Jeff Hammerbacher‘s links: http://www.mendeley.com/groups/1…
- Scaling up machine learning: http://www.cs.umass.edu/~ronb/sc…
- Zero to Hadoop in 5 min with Common Crawl: http://www.commoncrawl.org/mapre…
- Antonio Piccolboni, Looking for a MapReduce language:http://blog.piccolboni.info/2011…
- Machine Learning: What are some good learning projects to teach oneself about machine learning?
Hadoop Cookbook – 4, How to run multiple hadoop data nodes on one machine.
Although Hadoop is designed and developed for distributed computing it can be run on a single node in pseudo distributed mode and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons.
If you want to feel Hadoop’s distributed data node – name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it’s metadata , fsimage,edits , fstime and how data node stores data blocks on local file system.
Steps
To start multiple data nodes on a single node first download / build hadoop binary.
- Download hadoop binary or build hadoop binary from hadoop source.
- Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)
- Add following script to the $HADOOP_HOME/bin directory and chmod it to 744.
- Format HDFS – bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode -format (for version > 0.21)
- Start HDFS bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:50070
- Start additional data nodes using bin/run-additionalDN.sh
run-additionalDN.sh
#!/bin/sh # This is used for starting multiple datanodes on the same machine. # run it from hadoop-dir/ just like 'bin/hadoop'#Usage: run-additionalDN.sh [start|stop] dnnumber #e.g. run-datanode.sh start 2DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/" if [ -z $DN_DIR_PREFIX ]; then echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn" exit 1 fi run_datanode () { DN=$2 export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logs export HADOOP_PID_DIR=$HADOOP_LOG_DIR DN_CONF_OPTS="\ -Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\ -Ddfs.datanode.address=0.0.0.0:5001$DN \ -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN" bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS } cmd=$1 shift; for i in $* do run_datanode $cmd $i done
Use jps or Namenode Web UI to verify if additional data nodes are started.
I started total 3 data nodes ( 2 additional data nodes) on my single node machine which are running on ports 50010,50011 and 50012 as shown in screen shot below.
Hadoop Cookbook – 3, How to build your own Hadoop distribution.
Problem : You want to build your own Hadoop distribution.
Often you need particular feature added through patch in your Hadoop build and it’s still in trunk and not available in Hadoop releases . In such cases you can build and distribute your own Hadoop distribution.
Solution: You can build your own version of Hadoop distribution by following steps given below.
1. Checkout latest released branch (lets say we want to work on Hadoop 0.20 branch)
> svn checkout \ http://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z/ hadoop-common-X.Y.Z
2. Download required patch
3. Apply required patch -> patch -p0 -E < /path/to/patch
4. Test patch
ant \
-Dpatch.file=/patch/to/my.patch \
-Dforrest.home=/path/to/forrest/ \
-Dfindbugs.home=/path/to/findbugs \
-Dscratch.dir=/path/to/a/temp/dir \ (optional)
-Dsvn.cmd=/path/to/subversion/bin/svn \ (optional)
-Dgrep.cmd=/path/to/grep \ (optional)
-Dpatch.cmd=/path/to/patch \ (optional)
test-patch
5. Build Hadoop binary with documentation
ant -Djava5.home=$Java5Home -Dforrest.home=/path_to/apache-forrest
-Dfindbugs.home=/path_to/findbugs/latest compile-core tar
Successful completion of above command will create hadoop tar which can be used as hadoop distribution.
Yahoo! giving away free tickets to 2010 Hadoop Summit.
Get ready for 3rd Hadoop Summit which will be held on 29th June 2010 at Hyatt Regency in Santa Clara .
Yahoo is giving free tickets for Hadoop summit 2010 to Hadoop Summit Retweet Contest winners. To win these tickets you just have to follow @YDN on twitter and keep eye on @YDN “fun fact” of the day tweet about Hadoop every Monday. Then on the same day, retweet “fun fact” about cloud computing along with the hash tag, #Y!Hadoop. All RTs must be received by 11:59 pm EST on the same Monday.Very next day Yahoo will randomly select one lucky winner to receive 2 complimentary tickets to the Hadoop Summit.
Click here for the official posting on YDN.
Hadoop Cookbook – 2 , How to build Hadoop with my custom patch?
Problem : How do I build my own version of Hadoop with my custom patch.
Solution : Apply patch and build hadoop.
You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant, Java 5 (for generating Documents), Apache Forrest (for generating documents).
Steps :
Checkout hadoop source code,
> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m “Hadoop-X.Y.Z-rcR.release.”
Apply your patch for checking it’s functionality using following command
> patch -p0 -E < ~/Path/To/Patch.patch
Ant test and compile source code with latest patch.
> ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest compile-core compile-core tar
How to build documents.
> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs
Hadoop cookbook – 1. How to transfer data between different HDFS clusters.
Problem : You have multiple Hadoop clusters running and you want to transfer several tera bytes of data from one cluster to another.
Solution : DistCp – Distributed copy.
It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size
), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.
To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
You can also specify multiple source directories on the command line:
bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
hdfs://nn2:8020/bar/foo
Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
hdfs://nn2:8020/bar/foo
Where srclist contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b
Click here to learn more about DistCp

