Netflix + (Hadoop & Hive ) = Your Favorite Movie

I love watching movies and I am a huge fan of Netflix , I love the way Netflix suggests new movies based on my previous watching history and ratings. Netflix established in 1997 and headquartered in Los Gatos, California, Netflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States.It has a collection of 100,000 titles and more than10 million subscribers. The company has more than 55 million discs and, on average, ships 1.9 million DVDs to customers each day. On April 2, 2009, the company announced that it had mailed its two billionth DVD Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home.

According to Comscore in the month of December 2008 Netflix streamed 127,016,000 videos which makes Netflix top 20 online video site.

Netflix’s movie recommendation algorithm uses Hive ( underneath using Hadoop, HDFS & MapReduce ) for query processing and BI. Hive is used for scalable log analysis to gain business intelligence. Netflix collects all logs from website which are streaming logs collected using Chukwa (Soon to be replaced by Hunu alternative to Chukwa, Which will be open sourced very soon).

Currently Netflix is processing there streaming logs on Amazon S3.

How Much Data?

  1. Parsing 0.6 TB logs per day.
  2. Running 50 persistent nodes on Amazon EC3 cluster.

How Often?

Hadoop jobs are run every hour to parse last hour logs and reconstruct sessions. Then small files are merged from each reducer which are loaded to Hive. How Data is processed? Netflix web usage logs generated by web applications are collected by Chukwa Collector. These logs are dumped on Amazon S3 running HDFS instances. Later these logs are consumed by Hive & Hadoop running on the cloud. For NetFlix processing data on AS3 is not cheaper but it’s less risky because they can recover data.

Useful data generated using log analysis.

  1. Streaming summary data
  2. CDN performance.
  3. Number of streams per day.
  4. Number of errors / session.
  5. Test cell analysis.
  6. Ad hoc query for further analysis.
  7. And BI processing using MicroStrategy.