Security in Hadoop, Part – 1.

How secure is your Hadoop cluster?

Currently most of the production clusters running Hadoop are using one of the following versions  0.17, 0.18, 0.19, 0.20 . Amazon Elastic MapReduce runs on Hadoop 0.18.3.  Facebook is using Hadoop 0.19 with Append feature turned on.  Yahoo! is running  worlds largest Hadoop cluster with version 0.20.

None of these versions are fully secure and it’s very easy to breach file permissions in HDFS and Map Reduce jobs running on cluster. Although most big companies run their Hadoop clusters behind firewall and they are not exposed to external world. But what if your Hadoop cluster is deployed on third party cloud services?

If you are using third party cloud computing to run your Hadoop cluster then remember “Your HDFS data is not secure enough!”

How Hadoop file permission  & Quotas works?

Hadoop distributed file system (HDFS). Supports weak permission settings (chmod , chown ,chgrp)and quota settings (fileQuota and diskSpaceQuota) explained in details later.

The reason it’s weak because of the way Hadoop identifies users and groups.

HDFS file permission & ownership.

Similar to POSIX file system Hadoop file system also gives administrators and user ability to apply file permissions and restrict read write access. You can use chmod to change file permissions and chown to change file ownership.

hadoop fs  –chmod  744 fileName
hadoop fs  –chmod 744 –R dirName
hadoop fs  -chown  ravi :hdfs filename
hadoop fs –chown  -R ravi:hdfs dirName
hadoop fs –chgrp group filename
hadoop fs –chgrp –R group dirName

Configuration parameters for permissions.

dfs.permissions = true

If yes use the permissions system as described here. If no, permission checking is turned off, but all other behavior is unchanged.

dfs.web.ugi = webuser,webgroup

The user name to be used by the web server.

dfs.permissions.supergroup = supergroup

The name of the group of super-users.

HDFS quotas.

Quotas are managed by a set of commands available only to the administrator.

dfsadmin -setQuota  ...

Set the name quota to be N for each directory.

dfsadmin -clrQuota ...

Remove any name quota for each directory.

dfsadmin -setSpaceQuota  ...

Set the space quota to be N bytes for each directory. This is a hard limit on total size of all the files under the directory tree. The space quota takes replication also into account, i.e. one GB of data with replication of 3 consumes 3GB of quota.

dfsadmin -clrSpaceQuota ...

Remove any space quota for each directory.

How HDFS identifies users?

HDFS uses unix ` whoami`  utility to identify  users , and `bash –c groups` for groups.  And this is the weakest link because of which Hadoop file permissions and quota settings are for namesake.

You can write your own whoami script or groups script and add it in your path to impersonate some one else including super user.

HDFS Super user

The super-user is the user with the same identity as name node process itself. If you started the name node, then you are the super-user. The super-user can do anything in that permissions checks never fail for the super-user.

Other Security flaws in Hadoop.

In December 2009 Owen O’Malley Chair of Hadoop PMC and Software Architect at Yahoo! published Hadoop security design document. Design team found following security risks

1. Hadoop services do not authenticate users or other services. As a result, Hadoop is subject to the following security risks.

(a) A user can access an HDFS or MapReduce cluster as any other user. This makes it impossible to enforce access control in an uncooperative environment. For example, file permission checking on HDFS can be easily circumvented.

(b) An attacker can masquerade as Hadoop services. For example, user code running on a MapReduce cluster can register itself as a new TaskTracker.

2. DataNodes do not enforce any access control on accesses to its data blocks. This makes it possible for an unauthorized client to read a data block as long as she can supply its block ID. It’s also possible for anyone to write arbitrary data blocks to DataNodes.

Secure Hadoop is coming.

Yahoo! runs world’s largest Hadoop production application.[1] And Yahoo! is a major contributor in Hadoop project. (contributing more than 90%)[2] . Owen O’Malley from the Yahoo! Hadoop Team will provide an overview of the upcoming Hadoop Security release. Owen will describe the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform.[3]

Next blog post will cover Hadoop security release and it’s architecture.


Netflix + (Hadoop & Hive ) = Your Favorite Movie

I love watching movies and I am a huge fan of Netflix , I love the way Netflix suggests new movies based on my previous watching history and ratings. Netflix established in 1997 and headquartered in Los Gatos, California, Netflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States.It has a collection of 100,000 titles and more than10 million subscribers. The company has more than 55 million discs and, on average, ships 1.9 million DVDs to customers each day. On April 2, 2009, the company announced that it had mailed its two billionth DVD Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home.

According to Comscore in the month of December 2008 Netflix streamed 127,016,000 videos which makes Netflix top 20 online video site.

Netflix’s movie recommendation algorithm uses Hive ( underneath using Hadoop, HDFS & MapReduce ) for query processing and BI. Hive is used for scalable log analysis to gain business intelligence. Netflix collects all logs from website which are streaming logs collected using Chukwa (Soon to be replaced by Hunu alternative to Chukwa, Which will be open sourced very soon).

Currently Netflix is processing there streaming logs on Amazon S3.

How Much Data?

  1. Parsing 0.6 TB logs per day.
  2. Running 50 persistent nodes on Amazon EC3 cluster.

How Often?

Hadoop jobs are run every hour to parse last hour logs and reconstruct sessions. Then small files are merged from each reducer which are loaded to Hive. How Data is processed? Netflix web usage logs generated by web applications are collected by Chukwa Collector. These logs are dumped on Amazon S3 running HDFS instances. Later these logs are consumed by Hive & Hadoop running on the cloud. For NetFlix processing data on AS3 is not cheaper but it’s less risky because they can recover data.

Useful data generated using log analysis.

  1. Streaming summary data
  2. CDN performance.
  3. Number of streams per day.
  4. Number of errors / session.
  5. Test cell analysis.
  6. Ad hoc query for further analysis.
  7. And BI processing using MicroStrategy.

Mozilla Foundation(makers of Firefox,Thunderbird) using Hadoop & HBase.

This month’s Bay area HBase user group meet up was held at Mozilla Foundation in Mountain View. (Mozilla foundation is known for there open source products Firefox, Thunderbird, Bugzilla etc).

Andrew Purtel presented interesting talk on installing HBase using Amazon EC2 cluster. Mozilla’s metrics engineer Daniel talked about how  HBase and Hadoop are used at Mozilla corporation.

Daniel Einspanjer who is Metrics Software Engineer at Mozilla Corporation mentioned Mozilla is using Hadoop and HBase for generating usage and crash report metrics. According to Daniel there are more than 350 million Firefox users who reports around 1 million crash reports every day. These crash reports are valuable information for developers, which can be used for debugging purposes. Also these metrics includes information about plugin usage which is valuable for plugin developers.

Mozilla foundation wants to process these crash report data so that they can share important usage and debug information with developers and users. Currently Mozilla is using around 10-15% of crash reports received daily.

By processing these crash reports Mozilla will be able to gain information such as

  1. Total number of Firefox users.
  2. Add-on update pings.
  3. Average add-ons used by per user.

  • Source :
  • Mozilla aims to process more than 15 % crash reports. To achieve this goal they are  building a system to store crashes data for long time and process these data using Hadoop and HBase. Currently Mozilla is running production cluster with Hadoop 0.20 on 20 nodes to process crash report data.

Knowledge is precious. Open source solutions for processing big data and getting Knowledge.

How much data is generated on internet every year/month/day?

According to Neilson Online currently there are more than 1,733,993,741 internet users.  How much data these users are generating ?

Few numbers to understand how much data is generated every year.

* 90 trillion – The number of emails sent on the Internet in 2009.
* 247 billion – Average number of email messages per day.
* 1.4 billion – The number of email users worldwide.
* 100 million – New email users since the year before.
* 81% – The percentage of emails that were spam.
* 92% – Peak spam levels late in the year.
* 24% – Increase in spam since last year.
* 200 billion – The number of spam emails per day (assuming 81% are spam).

* 234 million – The number of websites as of December 2009.
* 47 million – Added websites in 2009.

Web servers

* 13.9% – The growth of Apache websites in 2009.
* -22.1% – The growth of IIS websites in 2009.
* 35.0% – The growth of Google GFE websites in 2009.
* 384.4% – The growth of Nginx websites in 2009.
* -72.4% – The growth of Lighttpd websites in 2009.

Domain names

* 81.8 million – .COM domain names at the end of 2009.
* 12.3 million – .NET domain names at the end of 2009.
* 7.8 million – .ORG domain names at the end of 2009.
* 76.3 million – The number of country code top-level domains (e.g. .CN, .UK, .DE, etc.).
* 187 million – The number of domain names across all top-level domains (October 2009).
* 8% – The increase in domain names since the year before.

Internet users

* 1.73 billion – Internet users worldwide (September 2009).
* 18% – Increase in Internet users since the previous year.
* 738,257,230 – Internet users in Asia.
* 418,029,796 – Internet users in Europe.
* 252,908,000 – Internet users in North America.
* 179,031,479 – Internet users in Latin America / Caribbean.
* 67,371,700 – Internet users in Africa.
* 57,425,046 – Internet users in the Middle East.
* 20,970,490 – Internet users in Oceania / Australia.

Social media

* 126 million – The number of blogs on the Internet (as tracked by BlogPulse).
* 84% – Percent of social network sites with more women than men.
* 27.3 million – Number of tweets on Twitter per day (November, 2009)
* 57% – Percentage of Twitter’s user base located in the United States.
* 4.25 million – People following @aplusk (Ashton Kutcher, Twitter’s most followed user).
* 350 million – People on Facebook.
* 50% – Percentage of Facebook users that log in every day.
* 500,000 – The number of active Facebook applications.


* 4 billion – Photos hosted by Flickr (October 2009).
* 2.5 billion – Photos uploaded each month to Facebook.
* 30 billion – At the current rate, the number of photos uploaded to Facebook per year.


* 1 billion – The total number of videos YouTube serves in one day.
* 12.2 billion – Videos viewed per month on YouTube in the US (November 2009).
* 924 million – Videos viewed per month on Hulu in the US (November 2009).
* 182 – The number of online videos the average Internet user watches in a month (USA).
* 82% – Percentage of Internet users that view videos online (USA).
* 39.4% – YouTube online video market share (USA).
* 81.9% – Percentage of embedded videos on blogs that are YouTube videos.

Web browsers

* 62.7% – Internet Explorer
* 24.6% – Firefox
* 4.6% – Chrome
* 4.5% – Safari
* 2.4% – Opera
* 1.2% – Other

Malicious software

* 148,000 – New zombie computers created per day (used in botnets for sending spam, etc.)
* 2.6 million – Amount of malicious code threats at the start of 2009 (viruses, trojans, etc.)
* 921,143 – The number of new malicious code signatures added by Symantec in Q4 2009.

Data is abundant, Information is useful, Knowledge is precious.

Data. – Data is raw and it’s abundant. It simply exists and has no significance beyond its existence . It can exist in any form, usable or not. It does not have meaning of itself. Collecting users activity log will produces data.

Information. –  Information is data that has been given meaning by way of relational connection.

Knowledge. – Knowledge is the appropriate collection of information, such that it’s intent is to be useful.

Internet users are generating petabytes of data every day . Millions of users access billions of web pages every millisecond,creating hundreds of server logs with every keystroke and mouse click. Having only user log data is not useful. To give better service to user and generate money for business  it is required to process raw data and collect information which can be used for providing knowledge to users and advertisers.

Open source solutions for processing big data.

Following are some of the open source solutions for processing big data.

Hadoop : Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these sub-projects

Hadoop ecosystem consists.

HDFS – Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Map ReduceMapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers.

PigPig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

HiveHive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

HbaseHBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.

Voldemart – Voldemort is a distributed key-value storage system

Cassandra -The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributeddesign and Bigtable’s ColumnFamily-based data model.

Website and web server stats from Netcraft. Domain name stats from Verisign and Internet user stats from Internet World Stats. Web browser stats from Net Applications. Email stats from Radicati Group. Spam stats from McAfee. Malware stats from Symantec (and here) and McAfee. Online video stats from Comscore, Sysomos and YouTube. Photo stats from Flickr and Facebook. Social media stats from BlogPulse, Pingdom (here and here), Twittercounter, Facebook and GigaOm.