Hadoop cookbook – 1. How to transfer data between different HDFS clusters.

Problem : You have multiple Hadoop clusters running and you want to  transfer  several tera bytes of data from one cluster to another.

Solution : DistCp – Distributed copy.

It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size 🙂  ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.

To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \


You can also specify multiple source directories on the command line:

bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \

Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \

Where srclist contains

Click here to learn more about DistCp


Why Hadoop ?

Why Hadoop is latest buzz word? Not only in valley but everywhere else. In Europe CERN scientist are evaluating HDFS file system for storing huge data (approximately 40TB / day ) generated by LHC to store and process on HDFS. In Asia China Mobile worlds biggest telecom giant is using Hadoop for processing huge data. And in Silicon Valley every other web company is already using Hadoop (Yahoo! , Facebook, LinkedIn, Netflix, IBM , Twitter , Zynga, Amazon….. list goes on) see complete list on Powered by page on Hadoop wiki.

Last month Microsoft announced  that they will be supporting Hadoop on Azure (Cloud computing platform by MS, competing with Amazon EC2 and S3).

There can’t be any argument that Hadoop gained this momentum because it’s an open source project. And credit goes to Yahoo! Although Hadoop was brain child of Dough Cutting (Formerly Yahoo! employee). Yahoo! spent tremendous resources in making Hadoop what it is today.  Yahoo! contributes more than 80% code of Hadoop.

Who said there is no free lunch?

Being open source project Hadoop offers  the best pricing for everybody which is FREE. Also very active community of Hadoop developers and users are useful resource for newbies.  The beauty of open source software is that not only it’s license and distribution is free but it’s support is also free by user and developer community.

Hadoop twiki  is documented with detailed information on Hadoop cluster setup and tutorials for Map reduce programming. You don’t have to pay anybody to setup your cluster or teach you how to write Map reduce programming.

Bend it like you want.

Open source  = source code is available to everyone. As Hadoop is an Apache project everybody can contribute in source code and everybody can express their opinion on how things should be done. As it’s source code is available to everyone , you can customize it as per your needs you can add your own functionalities and if you want you can contribute it back (Unfortunately there are some people out there who don’t contribute back to the community).

It’s  the complete package

Hadoop comes with complete package and there are many more things being added in this packaged , There are new applications being added on top of Hadoop which will be using Hadoop’s scalability and durability. Already there is NoSql stack rising for solving problems which traditional SQL can not solve. Hadoop supports other NoSql projects like Pig, Hive, Hbase which can be used for data mining, web mining  , ETL , BI .

My top 10 favorite iPhone apps.

I got my  iPhone 3GS last year in the first week of it’s launch, since then I have been in love with my iPhone. For me it’s more than  a mobile phone . I have been using it as Phone, GPS , iPod , Camera , PDA , Computer , Gaming device  and many more ways.

Here are my top 10 favorite apps out of  hundreds of apps I used so far .

1. Safari

I think Safari is best internet browser, it works great on Mac and equally great on iPhone . With Safari on iPhone web is at your finger tips every moment (Except places where AT&T network sucks)

2. Pandora

Pandora streams my favorite music tracks whenever I want. And I don’t have to pay for it !!

3. Yahoo! Finance

Yahoo! Finance is the best finance app in app store. It gives detail information about all stock quotes , history , news and latest updates.

4. Evernote

Evernote is very handy app for taking notes(Text, Images,Video, Webpages) and storing it in cloud , accessing it from Mac and updating it anywhere.

5. Mint

Mint keeps me updated with my expenses and helps me in tracking my budget.

6. Facebook

Facebook’s iPhone app keeps me connected with my friends when I am not near my laptop. I can post mobile updates and photos. If I am traveling and I liked something I can post it immediately.

7. Flixster & IMDB

These are two best apps for movie lovers, both are handy for finding latest movies and show times in theaters nearest your location. It also has reviews and trailers of upcoming movies.

8. Google Maps

Google Maps on iPhone is equally effective as GPS , although it’s not equally reliable due to bad network connection in remote areas.

9 . Shazam

I love the way Sahazm is implemented it’s simply awesome . It can detect any song and gives you option to play back it on YouTube , read lyrics or buy it from iTunes.

10. AroundMe

Whenever I am out on drive or roaming in town I can use AroundMe to found nearest restaurants , interesting places , shops , malls and many more.

What are your favorite iPhone apps.

Security in Hadoop, Part – 1.

How secure is your Hadoop cluster?

Currently most of the production clusters running Hadoop are using one of the following versions  0.17, 0.18, 0.19, 0.20 . Amazon Elastic MapReduce runs on Hadoop 0.18.3.  Facebook is using Hadoop 0.19 with Append feature turned on.  Yahoo! is running  worlds largest Hadoop cluster with version 0.20.

None of these versions are fully secure and it’s very easy to breach file permissions in HDFS and Map Reduce jobs running on cluster. Although most big companies run their Hadoop clusters behind firewall and they are not exposed to external world. But what if your Hadoop cluster is deployed on third party cloud services?

If you are using third party cloud computing to run your Hadoop cluster then remember “Your HDFS data is not secure enough!”

How Hadoop file permission  & Quotas works?

Hadoop distributed file system (HDFS). Supports weak permission settings (chmod , chown ,chgrp)and quota settings (fileQuota and diskSpaceQuota) explained in details later.

The reason it’s weak because of the way Hadoop identifies users and groups.

HDFS file permission & ownership.

Similar to POSIX file system Hadoop file system also gives administrators and user ability to apply file permissions and restrict read write access. You can use chmod to change file permissions and chown to change file ownership.

hadoop fs  –chmod  744 fileName
hadoop fs  –chmod 744 –R dirName
hadoop fs  -chown  ravi :hdfs filename
hadoop fs –chown  -R ravi:hdfs dirName
hadoop fs –chgrp group filename
hadoop fs –chgrp –R group dirName

Configuration parameters for permissions.

dfs.permissions = true

If yes use the permissions system as described here. If no, permission checking is turned off, but all other behavior is unchanged.

dfs.web.ugi = webuser,webgroup

The user name to be used by the web server.

dfs.permissions.supergroup = supergroup

The name of the group of super-users.

HDFS quotas.

Quotas are managed by a set of commands available only to the administrator.

dfsadmin -setQuota  ...

Set the name quota to be N for each directory.

dfsadmin -clrQuota ...

Remove any name quota for each directory.

dfsadmin -setSpaceQuota  ...

Set the space quota to be N bytes for each directory. This is a hard limit on total size of all the files under the directory tree. The space quota takes replication also into account, i.e. one GB of data with replication of 3 consumes 3GB of quota.

dfsadmin -clrSpaceQuota ...

Remove any space quota for each directory.

How HDFS identifies users?

HDFS uses unix ` whoami`  utility to identify  users , and `bash –c groups` for groups.  And this is the weakest link because of which Hadoop file permissions and quota settings are for namesake.

You can write your own whoami script or groups script and add it in your path to impersonate some one else including super user.

HDFS Super user

The super-user is the user with the same identity as name node process itself. If you started the name node, then you are the super-user. The super-user can do anything in that permissions checks never fail for the super-user.

Other Security flaws in Hadoop.

In December 2009 Owen O’Malley Chair of Hadoop PMC and Software Architect at Yahoo! published Hadoop security design document. Design team found following security risks

1. Hadoop services do not authenticate users or other services. As a result, Hadoop is subject to the following security risks.

(a) A user can access an HDFS or MapReduce cluster as any other user. This makes it impossible to enforce access control in an uncooperative environment. For example, file permission checking on HDFS can be easily circumvented.

(b) An attacker can masquerade as Hadoop services. For example, user code running on a MapReduce cluster can register itself as a new TaskTracker.

2. DataNodes do not enforce any access control on accesses to its data blocks. This makes it possible for an unauthorized client to read a data block as long as she can supply its block ID. It’s also possible for anyone to write arbitrary data blocks to DataNodes.

Secure Hadoop is coming.

Yahoo! runs world’s largest Hadoop production application.[1] And Yahoo! is a major contributor in Hadoop project. (contributing more than 90%)[2] . Owen O’Malley from the Yahoo! Hadoop Team will provide an overview of the upcoming Hadoop Security release. Owen will describe the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform.[3]

Next blog post will cover Hadoop security release and it’s architecture.

Netflix + (Hadoop & Hive ) = Your Favorite Movie

I love watching movies and I am a huge fan of Netflix , I love the way Netflix suggests new movies based on my previous watching history and ratings. Netflix established in 1997 and headquartered in Los Gatos, California, Netflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States.It has a collection of 100,000 titles and more than10 million subscribers. The company has more than 55 million discs and, on average, ships 1.9 million DVDs to customers each day. On April 2, 2009, the company announced that it had mailed its two billionth DVD Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home.

According to Comscore in the month of December 2008 Netflix streamed 127,016,000 videos which makes Netflix top 20 online video site.

Netflix’s movie recommendation algorithm uses Hive ( underneath using Hadoop, HDFS & MapReduce ) for query processing and BI. Hive is used for scalable log analysis to gain business intelligence. Netflix collects all logs from website which are streaming logs collected using Chukwa (Soon to be replaced by Hunu alternative to Chukwa, Which will be open sourced very soon).

Currently Netflix is processing there streaming logs on Amazon S3.

How Much Data?

  1. Parsing 0.6 TB logs per day.
  2. Running 50 persistent nodes on Amazon EC3 cluster.

How Often?

Hadoop jobs are run every hour to parse last hour logs and reconstruct sessions. Then small files are merged from each reducer which are loaded to Hive. How Data is processed? Netflix web usage logs generated by web applications are collected by Chukwa Collector. These logs are dumped on Amazon S3 running HDFS instances. Later these logs are consumed by Hive & Hadoop running on the cloud. For NetFlix processing data on AS3 is not cheaper but it’s less risky because they can recover data.

Useful data generated using log analysis.

  1. Streaming summary data
  2. CDN performance.
  3. Number of streams per day.
  4. Number of errors / session.
  5. Test cell analysis.
  6. Ad hoc query for further analysis.
  7. And BI processing using MicroStrategy.

Mozilla Foundation(makers of Firefox,Thunderbird) using Hadoop & HBase.

This month’s Bay area HBase user group meet up was held at Mozilla Foundation in Mountain View. (Mozilla foundation is known for there open source products Firefox, Thunderbird, Bugzilla etc).

Andrew Purtel presented interesting talk on installing HBase using Amazon EC2 cluster. Mozilla’s metrics engineer Daniel talked about how  HBase and Hadoop are used at Mozilla corporation.

Daniel Einspanjer who is Metrics Software Engineer at Mozilla Corporation mentioned Mozilla is using Hadoop and HBase for generating usage and crash report metrics. According to Daniel there are more than 350 million Firefox users who reports around 1 million crash reports every day. These crash reports are valuable information for developers, which can be used for debugging purposes. Also these metrics includes information about plugin usage which is valuable for plugin developers.

Mozilla foundation wants to process these crash report data so that they can share important usage and debug information with developers and users. Currently Mozilla is using around 10-15% of crash reports received daily.

By processing these crash reports Mozilla will be able to gain information such as

  1. Total number of Firefox users.
  2. Add-on update pings.
  3. Average add-ons used by per user.

  • Source : http://blog.mozilla.com/data/tag/hadoop/
  • Mozilla aims to process more than 15 % crash reports. To achieve this goal they are  building a system to store crashes data for long time and process these data using Hadoop and HBase. Currently Mozilla is running production cluster with Hadoop 0.20 on 20 nodes to process crash report data.

Knowledge is precious. Open source solutions for processing big data and getting Knowledge.

How much data is generated on internet every year/month/day?

According to Neilson Online currently there are more than 1,733,993,741 internet users.  How much data these users are generating ?

Few numbers to understand how much data is generated every year.

* 90 trillion – The number of emails sent on the Internet in 2009.
* 247 billion – Average number of email messages per day.
* 1.4 billion – The number of email users worldwide.
* 100 million – New email users since the year before.
* 81% – The percentage of emails that were spam.
* 92% – Peak spam levels late in the year.
* 24% – Increase in spam since last year.
* 200 billion – The number of spam emails per day (assuming 81% are spam).

* 234 million – The number of websites as of December 2009.
* 47 million – Added websites in 2009.

Web servers

* 13.9% – The growth of Apache websites in 2009.
* -22.1% – The growth of IIS websites in 2009.
* 35.0% – The growth of Google GFE websites in 2009.
* 384.4% – The growth of Nginx websites in 2009.
* -72.4% – The growth of Lighttpd websites in 2009.

Domain names

* 81.8 million – .COM domain names at the end of 2009.
* 12.3 million – .NET domain names at the end of 2009.
* 7.8 million – .ORG domain names at the end of 2009.
* 76.3 million – The number of country code top-level domains (e.g. .CN, .UK, .DE, etc.).
* 187 million – The number of domain names across all top-level domains (October 2009).
* 8% – The increase in domain names since the year before.

Internet users

* 1.73 billion – Internet users worldwide (September 2009).
* 18% – Increase in Internet users since the previous year.
* 738,257,230 – Internet users in Asia.
* 418,029,796 – Internet users in Europe.
* 252,908,000 – Internet users in North America.
* 179,031,479 – Internet users in Latin America / Caribbean.
* 67,371,700 – Internet users in Africa.
* 57,425,046 – Internet users in the Middle East.
* 20,970,490 – Internet users in Oceania / Australia.

Social media

* 126 million – The number of blogs on the Internet (as tracked by BlogPulse).
* 84% – Percent of social network sites with more women than men.
* 27.3 million – Number of tweets on Twitter per day (November, 2009)
* 57% – Percentage of Twitter’s user base located in the United States.
* 4.25 million – People following @aplusk (Ashton Kutcher, Twitter’s most followed user).
* 350 million – People on Facebook.
* 50% – Percentage of Facebook users that log in every day.
* 500,000 – The number of active Facebook applications.


* 4 billion – Photos hosted by Flickr (October 2009).
* 2.5 billion – Photos uploaded each month to Facebook.
* 30 billion – At the current rate, the number of photos uploaded to Facebook per year.


* 1 billion – The total number of videos YouTube serves in one day.
* 12.2 billion – Videos viewed per month on YouTube in the US (November 2009).
* 924 million – Videos viewed per month on Hulu in the US (November 2009).
* 182 – The number of online videos the average Internet user watches in a month (USA).
* 82% – Percentage of Internet users that view videos online (USA).
* 39.4% – YouTube online video market share (USA).
* 81.9% – Percentage of embedded videos on blogs that are YouTube videos.

Web browsers

* 62.7% – Internet Explorer
* 24.6% – Firefox
* 4.6% – Chrome
* 4.5% – Safari
* 2.4% – Opera
* 1.2% – Other

Malicious software

* 148,000 – New zombie computers created per day (used in botnets for sending spam, etc.)
* 2.6 million – Amount of malicious code threats at the start of 2009 (viruses, trojans, etc.)
* 921,143 – The number of new malicious code signatures added by Symantec in Q4 2009.

Data is abundant, Information is useful, Knowledge is precious.

Data. – Data is raw and it’s abundant. It simply exists and has no significance beyond its existence . It can exist in any form, usable or not. It does not have meaning of itself. Collecting users activity log will produces data.

Information. –  Information is data that has been given meaning by way of relational connection.

Knowledge. – Knowledge is the appropriate collection of information, such that it’s intent is to be useful.

Internet users are generating petabytes of data every day . Millions of users access billions of web pages every millisecond,creating hundreds of server logs with every keystroke and mouse click. Having only user log data is not useful. To give better service to user and generate money for business  it is required to process raw data and collect information which can be used for providing knowledge to users and advertisers.

Open source solutions for processing big data.

Following are some of the open source solutions for processing big data.

Hadoop : Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these sub-projects

Hadoop ecosystem consists.

HDFS – Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Map ReduceMapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers.

PigPig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

HiveHive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

HbaseHBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.

Voldemart – Voldemort is a distributed key-value storage system

Cassandra -The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributeddesign and Bigtable’s ColumnFamily-based data model.

Website and web server stats from Netcraft. Domain name stats from Verisign and Webhosting.info. Internet user stats from Internet World Stats. Web browser stats from Net Applications. Email stats from Radicati Group. Spam stats from McAfee. Malware stats from Symantec (and here) and McAfee. Online video stats from Comscore, Sysomos and YouTube. Photo stats from Flickr and Facebook. Social media stats from BlogPulse, Pingdom (here and here), Twittercounter, Facebook and GigaOm.