Security in Hadoop, Part – 1.

How secure is your Hadoop cluster?

Currently most of the production clusters running Hadoop are using one of the following versions  0.17, 0.18, 0.19, 0.20 . Amazon Elastic MapReduce runs on Hadoop 0.18.3.  Facebook is using Hadoop 0.19 with Append feature turned on.  Yahoo! is running  worlds largest Hadoop cluster with version 0.20.

None of these versions are fully secure and it’s very easy to breach file permissions in HDFS and Map Reduce jobs running on cluster. Although most big companies run their Hadoop clusters behind firewall and they are not exposed to external world. But what if your Hadoop cluster is deployed on third party cloud services?

If you are using third party cloud computing to run your Hadoop cluster then remember “Your HDFS data is not secure enough!”

How Hadoop file permission  & Quotas works?

Hadoop distributed file system (HDFS). Supports weak permission settings (chmod , chown ,chgrp)and quota settings (fileQuota and diskSpaceQuota) explained in details later.

The reason it’s weak because of the way Hadoop identifies users and groups.

HDFS file permission & ownership.

Similar to POSIX file system Hadoop file system also gives administrators and user ability to apply file permissions and restrict read write access. You can use chmod to change file permissions and chown to change file ownership.

hadoop fs  –chmod  744 fileName
hadoop fs  –chmod 744 –R dirName
hadoop fs  -chown  ravi :hdfs filename
hadoop fs –chown  -R ravi:hdfs dirName
hadoop fs –chgrp group filename
hadoop fs –chgrp –R group dirName

Configuration parameters for permissions.

dfs.permissions = true

If yes use the permissions system as described here. If no, permission checking is turned off, but all other behavior is unchanged.

dfs.web.ugi = webuser,webgroup

The user name to be used by the web server.

dfs.permissions.supergroup = supergroup

The name of the group of super-users.

HDFS quotas.

Quotas are managed by a set of commands available only to the administrator.

dfsadmin -setQuota  ...

Set the name quota to be N for each directory.

dfsadmin -clrQuota ...

Remove any name quota for each directory.

dfsadmin -setSpaceQuota  ...

Set the space quota to be N bytes for each directory. This is a hard limit on total size of all the files under the directory tree. The space quota takes replication also into account, i.e. one GB of data with replication of 3 consumes 3GB of quota.

dfsadmin -clrSpaceQuota ...

Remove any space quota for each directory.

How HDFS identifies users?

HDFS uses unix ` whoami`  utility to identify  users , and `bash –c groups` for groups.  And this is the weakest link because of which Hadoop file permissions and quota settings are for namesake.

You can write your own whoami script or groups script and add it in your path to impersonate some one else including super user.

HDFS Super user

The super-user is the user with the same identity as name node process itself. If you started the name node, then you are the super-user. The super-user can do anything in that permissions checks never fail for the super-user.

Other Security flaws in Hadoop.

In December 2009 Owen O’Malley Chair of Hadoop PMC and Software Architect at Yahoo! published Hadoop security design document. Design team found following security risks

1. Hadoop services do not authenticate users or other services. As a result, Hadoop is subject to the following security risks.

(a) A user can access an HDFS or MapReduce cluster as any other user. This makes it impossible to enforce access control in an uncooperative environment. For example, file permission checking on HDFS can be easily circumvented.

(b) An attacker can masquerade as Hadoop services. For example, user code running on a MapReduce cluster can register itself as a new TaskTracker.

2. DataNodes do not enforce any access control on accesses to its data blocks. This makes it possible for an unauthorized client to read a data block as long as she can supply its block ID. It’s also possible for anyone to write arbitrary data blocks to DataNodes.

Secure Hadoop is coming.

Yahoo! runs world’s largest Hadoop production application.[1] And Yahoo! is a major contributor in Hadoop project. (contributing more than 90%)[2] . Owen O’Malley from the Yahoo! Hadoop Team will provide an overview of the upcoming Hadoop Security release. Owen will describe the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform.[3]

Next blog post will cover Hadoop security release and it’s architecture.