Knowledge is precious. Open source solutions for processing big data and getting Knowledge.

How much data is generated on internet every year/month/day?

According to Neilson Online currently there are more than 1,733,993,741 internet users.  How much data these users are generating ?

Few numbers to understand how much data is generated every year.

* 90 trillion – The number of emails sent on the Internet in 2009.
* 247 billion – Average number of email messages per day.
* 1.4 billion – The number of email users worldwide.
* 100 million – New email users since the year before.
* 81% – The percentage of emails that were spam.
* 92% – Peak spam levels late in the year.
* 24% – Increase in spam since last year.
* 200 billion – The number of spam emails per day (assuming 81% are spam).

* 234 million – The number of websites as of December 2009.
* 47 million – Added websites in 2009.

Web servers

* 13.9% – The growth of Apache websites in 2009.
* -22.1% – The growth of IIS websites in 2009.
* 35.0% – The growth of Google GFE websites in 2009.
* 384.4% – The growth of Nginx websites in 2009.
* -72.4% – The growth of Lighttpd websites in 2009.

Domain names

* 81.8 million – .COM domain names at the end of 2009.
* 12.3 million – .NET domain names at the end of 2009.
* 7.8 million – .ORG domain names at the end of 2009.
* 76.3 million – The number of country code top-level domains (e.g. .CN, .UK, .DE, etc.).
* 187 million – The number of domain names across all top-level domains (October 2009).
* 8% – The increase in domain names since the year before.

Internet users

* 1.73 billion – Internet users worldwide (September 2009).
* 18% – Increase in Internet users since the previous year.
* 738,257,230 – Internet users in Asia.
* 418,029,796 – Internet users in Europe.
* 252,908,000 – Internet users in North America.
* 179,031,479 – Internet users in Latin America / Caribbean.
* 67,371,700 – Internet users in Africa.
* 57,425,046 – Internet users in the Middle East.
* 20,970,490 – Internet users in Oceania / Australia.

Social media

* 126 million – The number of blogs on the Internet (as tracked by BlogPulse).
* 84% – Percent of social network sites with more women than men.
* 27.3 million – Number of tweets on Twitter per day (November, 2009)
* 57% – Percentage of Twitter’s user base located in the United States.
* 4.25 million – People following @aplusk (Ashton Kutcher, Twitter’s most followed user).
* 350 million – People on Facebook.
* 50% – Percentage of Facebook users that log in every day.
* 500,000 – The number of active Facebook applications.


* 4 billion – Photos hosted by Flickr (October 2009).
* 2.5 billion – Photos uploaded each month to Facebook.
* 30 billion – At the current rate, the number of photos uploaded to Facebook per year.


* 1 billion – The total number of videos YouTube serves in one day.
* 12.2 billion – Videos viewed per month on YouTube in the US (November 2009).
* 924 million – Videos viewed per month on Hulu in the US (November 2009).
* 182 – The number of online videos the average Internet user watches in a month (USA).
* 82% – Percentage of Internet users that view videos online (USA).
* 39.4% – YouTube online video market share (USA).
* 81.9% – Percentage of embedded videos on blogs that are YouTube videos.

Web browsers

* 62.7% – Internet Explorer
* 24.6% – Firefox
* 4.6% – Chrome
* 4.5% – Safari
* 2.4% – Opera
* 1.2% – Other

Malicious software

* 148,000 – New zombie computers created per day (used in botnets for sending spam, etc.)
* 2.6 million – Amount of malicious code threats at the start of 2009 (viruses, trojans, etc.)
* 921,143 – The number of new malicious code signatures added by Symantec in Q4 2009.

Data is abundant, Information is useful, Knowledge is precious.

Data. – Data is raw and it’s abundant. It simply exists and has no significance beyond its existence . It can exist in any form, usable or not. It does not have meaning of itself. Collecting users activity log will produces data.

Information. –  Information is data that has been given meaning by way of relational connection.

Knowledge. – Knowledge is the appropriate collection of information, such that it’s intent is to be useful.

Internet users are generating petabytes of data every day . Millions of users access billions of web pages every millisecond,creating hundreds of server logs with every keystroke and mouse click. Having only user log data is not useful. To give better service to user and generate money for business  it is required to process raw data and collect information which can be used for providing knowledge to users and advertisers.

Open source solutions for processing big data.

Following are some of the open source solutions for processing big data.

Hadoop : Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these sub-projects

Hadoop ecosystem consists.

HDFS – Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

Map ReduceMapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers.

PigPig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

HiveHive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

HbaseHBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.

Voldemart – Voldemort is a distributed key-value storage system

Cassandra -The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributeddesign and Bigtable’s ColumnFamily-based data model.

Website and web server stats from Netcraft. Domain name stats from Verisign and Internet user stats from Internet World Stats. Web browser stats from Net Applications. Email stats from Radicati Group. Spam stats from McAfee. Malware stats from Symantec (and here) and McAfee. Online video stats from Comscore, Sysomos and YouTube. Photo stats from Flickr and Facebook. Social media stats from BlogPulse, Pingdom (here and here), Twittercounter, Facebook and GigaOm.