Hadoop Cookbook – 4, How to run multiple hadoop data nodes on one machine.
Although Hadoop is designed and developed for distributed computing it can be run on a single node in pseudo distributed mode and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons.
If you want to feel Hadoop’s distributed data node – name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it’s metadata , fsimage,edits , fstime and how data node stores data blocks on local file system.
Steps
To start multiple data nodes on a single node first download / build hadoop binary.
- Download hadoop binary or build hadoop binary from hadoop source.
- Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)
- Add following script to the $HADOOP_HOME/bin directory and chmod it to 744.
- Format HDFS – bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode -format (for version > 0.21)
- Start HDFS bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:50070
- Start additional data nodes using bin/run-additionalDN.sh
run-additionalDN.sh
#!/bin/sh # This is used for starting multiple datanodes on the same machine. # run it from hadoop-dir/ just like 'bin/hadoop'#Usage: run-additionalDN.sh [start|stop] dnnumber #e.g. run-datanode.sh start 2DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/" if [ -z $DN_DIR_PREFIX ]; then echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn" exit 1 fi run_datanode () { DN=$2 export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logs export HADOOP_PID_DIR=$HADOOP_LOG_DIR DN_CONF_OPTS="\ -Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\ -Ddfs.datanode.address=0.0.0.0:5001$DN \ -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN" bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS } cmd=$1 shift; for i in $* do run_datanode $cmd $i done
Use jps or Namenode Web UI to verify if additional data nodes are started.
I started total 3 data nodes ( 2 additional data nodes) on my single node machine which are running on ports 50010,50011 and 50012 as shown in screen shot below.
Hadoop Cookbook – 3, How to build your own Hadoop distribution.
Problem : You want to build your own Hadoop distribution.
Often you need particular feature added through patch in your Hadoop build and it’s still in trunk and not available in Hadoop releases . In such cases you can build and distribute your own Hadoop distribution.
Solution: You can build your own version of Hadoop distribution by following steps given below.
1. Checkout latest released branch (lets say we want to work on Hadoop 0.20 branch)
> svn checkout \ http://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z/ hadoop-common-X.Y.Z
2. Download required patch
3. Apply required patch -> patch -p0 -E < /path/to/patch
4. Test patch
ant \
-Dpatch.file=/patch/to/my.patch \
-Dforrest.home=/path/to/forrest/ \
-Dfindbugs.home=/path/to/findbugs \
-Dscratch.dir=/path/to/a/temp/dir \ (optional)
-Dsvn.cmd=/path/to/subversion/bin/svn \ (optional)
-Dgrep.cmd=/path/to/grep \ (optional)
-Dpatch.cmd=/path/to/patch \ (optional)
test-patch
5. Build Hadoop binary with documentation
ant -Djava5.home=$Java5Home -Dforrest.home=/path_to/apache-forrest
-Dfindbugs.home=/path_to/findbugs/latest compile-core tar
Successful completion of above command will create hadoop tar which can be used as hadoop distribution.
Yahoo! giving away free tickets to 2010 Hadoop Summit.
Get ready for 3rd Hadoop Summit which will be held on 29th June 2010 at Hyatt Regency in Santa Clara .
Yahoo is giving free tickets for Hadoop summit 2010 to Hadoop Summit Retweet Contest winners. To win these tickets you just have to follow @YDN on twitter and keep eye on @YDN “fun fact” of the day tweet about Hadoop every Monday. Then on the same day, retweet “fun fact” about cloud computing along with the hash tag, #Y!Hadoop. All RTs must be received by 11:59 pm EST on the same Monday.Very next day Yahoo will randomly select one lucky winner to receive 2 complimentary tickets to the Hadoop Summit.
Click here for the official posting on YDN.
Hadoop Cookbook – 2 , How to build Hadoop with my custom patch?
Problem : How do I build my own version of Hadoop with my custom patch.
Solution : Apply patch and build hadoop.
You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant, Java 5 (for generating Documents), Apache Forrest (for generating documents).
Steps :
Checkout hadoop source code,
> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m “Hadoop-X.Y.Z-rcR.release.”
Apply your patch for checking it’s functionality using following command
> patch -p0 -E < ~/Path/To/Patch.patch
Ant test and compile source code with latest patch.
> ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest compile-core compile-core tar
How to build documents.
> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs
Hadoop cookbook – 1. How to transfer data between different HDFS clusters.
Problem : You have multiple Hadoop clusters running and you want to transfer several tera bytes of data from one cluster to another.
Solution : DistCp – Distributed copy.
It’s common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size
), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another.
To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below.
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
You can also specify multiple source directories on the command line:
bash$ hadoop distcp hdfs://nn1:8020/foo/a \
hdfs://nn1:8020/foo/b \
hdfs://nn2:8020/bar/foo
Or, equivalently, from a file using the -f option:
bash$ hadoop distcp -f hdfs://nn1:8020/srclist \
hdfs://nn2:8020/bar/foo
Where srclist contains
hdfs://nn1:8020/foo/a
hdfs://nn1:8020/foo/b
Click here to learn more about DistCp
Why Hadoop ?
Why Hadoop is latest buzz word? Not only in valley but everywhere else. In Europe CERN scientist are evaluating HDFS file system for storing huge data (approximately 40TB / day ) generated by LHC to store and process on HDFS. In Asia China Mobile worlds biggest telecom giant is using Hadoop for processing huge data. And in Silicon Valley every other web company is already using Hadoop (Yahoo! , Facebook, LinkedIn, Netflix, IBM , Twitter , Zynga, Amazon….. list goes on) see complete list on Powered by page on Hadoop wiki.
Last month Microsoft announced that they will be supporting Hadoop on Azure (Cloud computing platform by MS, competing with Amazon EC2 and S3).
There can’t be any argument that Hadoop gained this momentum because it’s an open source project. And credit goes to Yahoo! Although Hadoop was brain child of Dough Cutting (Formerly Yahoo! employee). Yahoo! spent tremendous resources in making Hadoop what it is today. Yahoo! contributes more than 80% code of Hadoop.
Who said there is no free lunch?
Being open source project Hadoop offers the best pricing for everybody which is FREE. Also very active community of Hadoop developers and users are useful resource for newbies. The beauty of open source software is that not only it’s license and distribution is free but it’s support is also free by user and developer community.
Hadoop twiki is documented with detailed information on Hadoop cluster setup and tutorials for Map reduce programming. You don’t have to pay anybody to setup your cluster or teach you how to write Map reduce programming.
Bend it like you want.
Open source = source code is available to everyone. As Hadoop is an Apache project everybody can contribute in source code and everybody can express their opinion on how things should be done. As it’s source code is available to everyone , you can customize it as per your needs you can add your own functionalities and if you want you can contribute it back (Unfortunately there are some people out there who don’t contribute back to the community).
It’s the complete package
Hadoop comes with complete package and there are many more things being added in this packaged , There are new applications being added on top of Hadoop which will be using Hadoop’s scalability and durability. Already there is NoSql stack rising for solving problems which traditional SQL can not solve. Hadoop supports other NoSql projects like Pig, Hive, Hbase which can be used for data mining, web mining , ETL , BI .
My top 10 favorite iPhone apps.
I got my iPhone 3GS last year in the first week of it’s launch, since then I have been in love with my iPhone. For me it’s more than a mobile phone . I have been using it as Phone, GPS , iPod , Camera , PDA , Computer , Gaming device and many more ways.
Here are my top 10 favorite apps out of hundreds of apps I used so far .
1. Safari
I think Safari is best internet browser, it works great on Mac and equally great on iPhone . With Safari on iPhone web is at your finger tips every moment (Except places where AT&T network sucks)
2. Pandora
Pandora streams my favorite music tracks whenever I want. And I don’t have to pay for it !!
3. Yahoo! Finance
Yahoo! Finance is the best finance app in app store. It gives detail information about all stock quotes , history , news and latest updates.
4. Evernote
Evernote is very handy app for taking notes(Text, Images,Video, Webpages) and storing it in cloud , accessing it from Mac and updating it anywhere.
5. Mint
Mint keeps me updated with my expenses and helps me in tracking my budget.
6. Facebook
Facebook’s iPhone app keeps me connected with my friends when I am not near my laptop. I can post mobile updates and photos. If I am traveling and I liked something I can post it immediately.
7. Flixster & IMDB
These are two best apps for movie lovers, both are handy for finding latest movies and show times in theaters nearest your location. It also has reviews and trailers of upcoming movies.
8. Google Maps
Google Maps on iPhone is equally effective as GPS , although it’s not equally reliable due to bad network connection in remote areas.
9 . Shazam
I love the way Sahazm is implemented it’s simply awesome . It can detect any song and gives you option to play back it on YouTube , read lyrics or buy it from iTunes.
10. AroundMe
Whenever I am out on drive or roaming in town I can use AroundMe to found nearest restaurants , interesting places , shops , malls and many more.
What are your favorite iPhone apps.
Security in Hadoop, Part – 1.
How secure is your Hadoop cluster?
Currently most of the production clusters running Hadoop are using one of the following versions 0.17, 0.18, 0.19, 0.20 . Amazon Elastic MapReduce runs on Hadoop 0.18.3. Facebook is using Hadoop 0.19 with Append feature turned on. Yahoo! is running worlds largest Hadoop cluster with version 0.20.
None of these versions are fully secure and it’s very easy to breach file permissions in HDFS and Map Reduce jobs running on cluster. Although most big companies run their Hadoop clusters behind firewall and they are not exposed to external world. But what if your Hadoop cluster is deployed on third party cloud services?
If you are using third party cloud computing to run your Hadoop cluster then remember “Your HDFS data is not secure enough!”
How Hadoop file permission & Quotas works?
Hadoop distributed file system (HDFS). Supports weak permission settings (chmod , chown ,chgrp)and quota settings (fileQuota and diskSpaceQuota) explained in details later.
The reason it’s weak because of the way Hadoop identifies users and groups.
HDFS file permission & ownership.
Similar to POSIX file system Hadoop file system also gives administrators and user ability to apply file permissions and restrict read write access. You can use chmod to change file permissions and chown to change file ownership.
hadoop fs –chmod 744 fileName
hadoop fs –chmod 744 –R dirName
hadoop fs -chown ravi :hdfs filename
hadoop fs –chown -R ravi:hdfs dirName
hadoop fs –chgrp group filename
hadoop fs –chgrp –R group dirName
Configuration parameters for permissions.
dfs.permissions = true
If yes use the permissions system as described here. If no, permission checking is turned off, but all other behavior is unchanged.
dfs.web.ugi = webuser,webgroup
The user name to be used by the web server.
dfs.permissions.supergroup = supergroup
The name of the group of super-users.
Quotas are managed by a set of commands available only to the administrator.
dfsadmin -setQuota ...
Set the name quota to be N for each directory.
dfsadmin -clrQuota ...
Remove any name quota for each directory.
dfsadmin -setSpaceQuota ...
Set the space quota to be N bytes for each directory. This is a hard limit on total size of all the files under the directory tree. The space quota takes replication also into account, i.e. one GB of data with replication of 3 consumes 3GB of quota.
dfsadmin -clrSpaceQuota ...
Remove any space quota for each directory.
How HDFS identifies users?
HDFS uses unix ` whoami` utility to identify users , and `bash –c groups` for groups. And this is the weakest link because of which Hadoop file permissions and quota settings are for namesake.
You can write your own whoami script or groups script and add it in your path to impersonate some one else including super user.
HDFS Super user
The super-user is the user with the same identity as name node process itself. If you started the name node, then you are the super-user. The super-user can do anything in that permissions checks never fail for the super-user.
Other Security flaws in Hadoop.
In December 2009 Owen O’Malley Chair of Hadoop PMC and Software Architect at Yahoo! published Hadoop security design document. Design team found following security risks
1. Hadoop services do not authenticate users or other services. As a result, Hadoop is subject to the following security risks.
(a) A user can access an HDFS or MapReduce cluster as any other user. This makes it impossible to enforce access control in an uncooperative environment. For example, file permission checking on HDFS can be easily circumvented.
(b) An attacker can masquerade as Hadoop services. For example, user code running on a MapReduce cluster can register itself as a new TaskTracker.
2. DataNodes do not enforce any access control on accesses to its data blocks. This makes it possible for an unauthorized client to read a data block as long as she can supply its block ID. It’s also possible for anyone to write arbitrary data blocks to DataNodes.
Secure Hadoop is coming.
Yahoo! runs world’s largest Hadoop production application.[1] And Yahoo! is a major contributor in Hadoop project. (contributing more than 90%)[2] . Owen O’Malley from the Yahoo! Hadoop Team will provide an overview of the upcoming Hadoop Security release. Owen will describe the features and capabilities included as well as operational benefits. Yahoo! is very excited about adding security capabilities to Hadoop and views this as major milestone in continuing to make Hadoop an enterprise-grade platform.[3]
Next blog post will cover Hadoop security release and it’s architecture.
Netflix + (Hadoop & Hive ) = Your Favorite Movie
I love watching movies and I am a huge fan of Netflix , I love the way Netflix suggests new movies based on my previous watching history and ratings. Netflix established in 1997 and headquartered in Los Gatos, California, Netflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States.It has a collection of 100,000 titles and more than10 million subscribers. The company has more than 55 million discs and, on average, ships 1.9 million DVDs to customers each day. On April 2, 2009, the company announced that it had mailed its two billionth DVD Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home.
According to Comscore in the month of December 2008 Netflix streamed 127,016,000 videos which makes Netflix top 20 online video site.
Netflix’s movie recommendation algorithm uses Hive ( underneath using Hadoop, HDFS & MapReduce ) for query processing and BI. Hive is used for scalable log analysis to gain business intelligence. Netflix collects all logs from website which are streaming logs collected using Chukwa (Soon to be replaced by Hunu alternative to Chukwa, Which will be open sourced very soon).
Currently Netflix is processing there streaming logs on Amazon S3.
How Much Data?
- Parsing 0.6 TB logs per day.
- Running 50 persistent nodes on Amazon EC3 cluster.
How Often?
Hadoop jobs are run every hour to parse last hour logs and reconstruct sessions. Then small files are merged from each reducer which are loaded to Hive. How Data is processed? Netflix web usage logs generated by web applications are collected by Chukwa Collector. These logs are dumped on Amazon S3 running HDFS instances. Later these logs are consumed by Hive & Hadoop running on the cloud. For NetFlix processing data on AS3 is not cheaper but it’s less risky because they can recover data.
Useful data generated using log analysis.
- Streaming summary data
- CDN performance.
- Number of streams per day.
- Number of errors / session.
- Test cell analysis.
- Ad hoc query for further analysis.
- And BI processing using MicroStrategy.
Mozilla Foundation(makers of Firefox,Thunderbird) using Hadoop & HBase.
This month’s Bay area HBase user group meet up was held at Mozilla Foundation in Mountain View. (Mozilla foundation is known for there open source products Firefox, Thunderbird, Bugzilla etc).
Andrew Purtel presented interesting talk on installing HBase using Amazon EC2 cluster. Mozilla’s metrics engineer Daniel talked about how HBase and Hadoop are used at Mozilla corporation.
Daniel Einspanjer who is Metrics Software Engineer at Mozilla Corporation mentioned Mozilla is using Hadoop and HBase for generating usage and crash report metrics. According to Daniel there are more than 350 million Firefox users who reports around 1 million crash reports every day. These crash reports are valuable information for developers, which can be used for debugging purposes. Also these metrics includes information about plugin usage which is valuable for plugin developers.
Mozilla foundation wants to process these crash report data so that they can share important usage and debug information with developers and users. Currently Mozilla is using around 10-15% of crash reports received daily.
By processing these crash reports Mozilla will be able to gain information such as
- Total number of Firefox users.
- Add-on update pings.
- Average add-ons used by per user.
- Source : http://blog.mozilla.com/data/tag/hadoop/
Mozilla aims to process more than 15 % crash reports. To achieve this goal they are building a system to store crashes data for long time and process these data using Hadoop and HBase. Currently Mozilla is running production cluster with Hadoop 0.20 on 20 nodes to process crash report data.

