TIL : Analytic functions in Oracle.

Oracle supports many useful analytics functions. Unless you unearth them from Oracle documentation it’s not easy to understand and use them often. This is my attempt to write about Analytic functions supported in Oracle and how to use them.


The Oracle/PLSQL LAG function is an analytic function that lets you query more than one row in a table at a time without having to join the table to itself.

It returns values from a previous row in the table.

Example :
Suppose we have following table.

City Population Year
San Jose 1,000,000 2014
San Francisco 1,230,000 2014
San Diego 900,000 2014
Las Vegas 904,000 2014
San Jose 1,200,000 2015
San Francisco 1,330,000 2015
San Diego 910,000 2015
Las Vegas 909,000 2015

CityPopulation Table

And we need to find out each city’s current year population and previous year’s population.
This can be done using Lag(lagging) function.


select city, population,
LAG (population,1) over (ORDER BY year) AS prev_population
from CityPopulation;

City Population Prev_Population
San Jose 1,200,000 1,000,000
San Francisco 1,330,000 1,230,000
San Diego 910,000 900,000
Las Vegas 909,000 904,000


As shown above. Similar to Lag function,  to return a value from the next row,  use the LEAD function.


My favorite features in Java 7

So far following are my favorite new features in Java 7, I wonder why these were not added in earlier version.

1. The try-with-resources Statement

Any object that implements java.lang.AutoCloseable, which includes all objects which implement java.io.Closeable, can be used as a resource.

static String readFirstLineFromFile(String path) throws IOException {
          try (BufferedReader br = new BufferedReader(new FileReader(path))) {
                     return br.readLine();

In this example, the resource declared in the try-with-resources statement is a BufferedReader. The declaration statement appears within parentheses immediately after the try keyword. The class BufferedReader, in Java SE 7 and later, implements the interface java.lang.AutoCloseable. Because the BufferedReader instance is declared in a try-with-resource statement, it will be closed regardless of whether the try statement completes normally or abruptly (as a result of the method BufferedReader.readLine throwing an IOException).

Prior to Java SE 7, you can use a finally block to ensure that a resource is closed regardless of whether the try statement completes normally or abruptly. The following example uses a finally block instead of a try-with-resources statement:

static String readFirstLineFromFileWithFinallyBlock(String path) throws IOException {
       BufferedReader br = new BufferedReader(new FileReader(path));
       try {
           return br.readLine();
       } finally { // no need of closing in finally clause in Java 7
          if (br != null) br.close();

And you can declare more than one resource to close:

try (
 InputStream in = new FileInputStream(src);
 OutputStream out = new FileOutputStream(dest))
 // code

2.Catching Multiple Exception Types and Rethrowing Exceptions with Improved Type Checking

In Java SE 7 and later, a single catch block can handle more than one type of exception. This feature can reduce code duplication and lessen the temptation to catch an overly broad exception.

Consider the following example, which contains duplicate code in each of the catch blocks:

   catch (IOException ex) {    
       throw ex; }
   catch (SQLException ex) { 
       throw ex; 

In releases prior to Java SE 7, it is difficult to create a common method to eliminate the duplicated code because the variable ex has different types.
The following example, which is valid in Java SE 7 and later, eliminates the duplicated code:

  catch (IOException|SQLException ex) {
    throw ex;

The catch clause specifies the types of exceptions that the block can handle, and each exception type is separated with a vertical bar (|).

Note: If a catch block handles more than one exception type, then the catch parameter is implicitly final. In this example, the catch parameter ex is final and therefore you cannot assign any values to it within the catch block.

Bytecode generated by compiling a catch block that handles multiple exception types will be smaller (and thus superior) than compiling many catch blocks that handle only one exception type each. A catch block that handles multiple exception types creates no duplication in the bytecode generated by the compiler; the bytecode has no replication of exception handlers.

3.Type Inference for Generic Instance Creation

You can replace the type arguments required to invoke the constructor of a generic class with an empty set of type parameters () as long as the compiler can infer the type arguments from the context. This pair of angle brackets is informally called the diamond.

For example, consider the following variable declaration:

    Map<String, List> myMap = new HashMap<String, List>();

In Java SE 7, you can substitute the parameterized type of the constructor with an empty set of type parameters ():

Class projects for Hadoop

Best way of learning anything is by doing it. To master Hadoop ecosystem you need to go beyond Word Count program. Here are list of some projects which I think of working on if I get time. This can be a good list of class projects for Hadoop.

1) Matrix Decomposition routines (QR, Cholesky etc)

2) Decision Trees with ID3, C4.5 or other heuristic (https://issues.apache.org/jira/b… ).

Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):

3) Linear Regression https://cwiki.apache.org/conflue… ,

Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord…

4) Gradient Descent and other optimization and linear programming algorithms, seeConvex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to… Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?

5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada…

6) SVM:




Support Vector Machines: What is the best way to implement an SVM using Hadoop?

7) Vector space models http://en.wikipedia.org/wiki/Vec…

8) Hidden Markov Models – an extremely popular method in NLP & bioinformatics.

9) Slope One by Daniel Lemirehttp://en.wikipedia.org/wiki/Slo… or otherCollaborative Filtering algorithms.

See Mahout in Action by Sean Owen:http://www.manning.com/owen/

10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to… ,  Image Processing toolbox: http://www.mathworks.com/help/to…  Wavelet Toolbox http://www.mathworks.com/help/to… also see OpenCV catalog: http://opencv.willowgarage.com/w…

11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u…

12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap…

13) For a wealth of open ended problems see Programming Challenges: What are some good “toy problems” in data science?


Hadoop Cookbook – 4, How to run multiple hadoop data nodes on one machine.

Although Hadoop is designed and developed for distributed computing it can be  run on  a single node in pseudo distributed mode  and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons.

If you want to feel Hadoop’s  distributed data node – name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it’s metadata , fsimage,edits , fstime  and how data node stores data blocks on local file system.


To start multiple data nodes on a single node first download / build hadoop binary.

  1. Download hadoop binary or build hadoop binary from hadoop source.
  2. Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)
  3. Add following script to the $HADOOP_HOME/bin directory and chmod it to 744.
  4. Format HDFS – bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode -format (for version > 0.21)
  5. Start HDFS  bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:50070
  6. Start additional data nodes using bin/run-additionalDN.sh


# This is used for starting multiple datanodes on the same machine.
# run it from hadoop-dir/ just like 'bin/hadoop' 

#Usage: run-additionalDN.sh [start|stop] dnnumber
#e.g. run-datanode.sh start 2


if [ -z $DN_DIR_PREFIX ]; then
echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn"
exit 1

run_datanode () {
-Ddfs.datanode.address=$DN \
-Ddfs.datanode.http.address=$DN \
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS


for i in $*
run_datanode  $cmd $i

Use jps or Namenode Web UI to verify if additional data nodes are started.

I started total 3 data nodes ( 2 additional data nodes) on my single node machine which are running on ports 50010,50011 and 50012 as shown in screen shot below.

Hadoop Cookbook – 3, How to build your own Hadoop distribution.

Problem : You want to build your own Hadoop distribution.

Often you need particular feature added through patch in your Hadoop build and it’s still in trunk and  not available in Hadoop releases . In such cases you can build and distribute your own Hadoop distribution.

Solution: You can build your own version of Hadoop distribution by following steps given below.

1. Checkout latest released branch (lets say we want to work on Hadoop 0.20 branch)

  > svn checkout \
  http://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z/ hadoop-common-X.Y.Z

2. Download required patch

3. Apply required patch -> patch -p0 -E < /path/to/patch

4. Test patch

 ant \
  -Dpatch.file=/patch/to/my.patch \
  -Dforrest.home=/path/to/forrest/ \
  -Dfindbugs.home=/path/to/findbugs \
  -Dscratch.dir=/path/to/a/temp/dir \ (optional)
  -Dsvn.cmd=/path/to/subversion/bin/svn \ (optional)
  -Dgrep.cmd=/path/to/grep \ (optional)
  -Dpatch.cmd=/path/to/patch \ (optional)

5. Build Hadoop binary with documentation
ant -Djava5.home=$Java5Home -Dforrest.home=/path_to/apache-forrest
-Dfindbugs.home=/path_to/findbugs/latest compile-core tar

Successful completion of above command will create hadoop tar which can be used as hadoop distribution.

Yahoo! giving away free tickets to 2010 Hadoop Summit.

Get ready for 3rd Hadoop Summit  which will be held on 29th June 2010 at Hyatt Regency  in Santa Clara .

Yahoo is giving free tickets for Hadoop summit 2010 to Hadoop Summit Retweet Contest winners. To win these tickets you just have to follow @YDN on twitter and keep eye on @YDN “fun fact” of the day tweet about Hadoop every Monday. Then on the same day, retweet “fun fact” about cloud computing along with the hash tag, #Y!Hadoop. All RTs must be received by 11:59 pm EST on the same Monday.Very next day Yahoo will randomly select one lucky winner to receive 2 complimentary tickets to the Hadoop Summit.

Click here for the official posting on YDN.

Hadoop Cookbook – 2 , How to build Hadoop with my custom patch?

Problem : How do I build my own version of Hadoop with my custom patch.

Solution : Apply patch and build hadoop.

You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant,  Java 5 (for generating Documents), Apache Forrest (for generating documents).

Steps :

Checkout hadoop source code,

> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m “Hadoop-X.Y.Z-rcR.release.”

Apply your patch for checking it’s functionality using following command

> patch -p0 -E < ~/Path/To/Patch.patch

Ant test and compile source code with latest patch.

> ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest  compile-core compile-core tar

How to build documents.

> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs