Monday, December 16, 2013

Using Cloudera parcels with no internet connectivity.

Easiest way to set CDH/Cloudera Hadoop is to use parcels. 

The parcels are downloaded from the internet (Cloudera archives) . 

When there is no internet connectivity , we need to download the parcels and set up a local server and point CM(Cloudera Manager ) to the local server.

Download the parcels from the below link.

Depending on your linux flavor download the appropriate parcel file and manifest.json file.

For ubuntu i downloaded CDH-4.5.0-1.cdh4.5.0.p0.30-precise.parcel . 

Setting up Apache server

Install apache server.

On ubuntu i used the below command.

sudo apt-get install apache2

Check from the browser http://localhost:80 .

The above link should give a success message. Which means apache server is successfully running on your computer.

Under /var/www/ directory there is a file called index.html .

Please delete the file.

Make a structure inside the /var/www/ like the below


Copy the parcel file to the above location.

Now you will be able to access the parcels from your URL (http://localhost:80) 

Now open Cloudera manager.

Select parcels.

Edit settings.

Edit remote parcel repository Urls to your local URL. 

Now you can download the parcels from your local server and distribute.

Wednesday, February 20, 2013



Before we try to understand mapReduce  i would like you to understand the hadoop file system (HDFS) . Please read my previous blog Understanding HDFS

Lets look into simple requirement . I have to analyze the messages sent on WhatsApp on Feb14 to come to know how many of them celebrated their valentines day. For the sake of simplicity lets assume that we have all the messages in a single system. The basic steps that could involve to find the word valentine in each message would be like below.

Search through each message for the word valentine.
Have a counter that increments each time i found the word in the message.

I write a program and run it in the system , we can understand how long it could really get to analyze such a simple information on a single system. So what happens when we have data across multiple systems. The straight forward answer could be parallel processing. But it comes with its own problems. 

One problem is we cannot be sure about when we will be able to finish our process in each system.The other problem is consolidating all the outputs. Hadoop helps us in parallel processing by providing  blocks of same size data through HDFS. It takes care of running the tasks on each data block and consolidating it. To make our hadoop understand the logic , the logic should be provided in MapReduce format.

Hadoop expects our program to be in the form of MapReduce . So lets try to understand how it works. MapReduce helps in data-processing parallely.  MapReduce uses a concept of divide and conquer method.

MapReduce consists of two methods. One is our map function and the other one is reduce function. Lets look into the map method. Map method accepts input in the form of key-value pairs. Lets see how it really works, the input to the map function will be in the below format .

The key value in our case is line offset number , the point to note here is the key value is generated by hadoop and it is unique. We will be worried about the values alone. Here the values mean the messages. so we will have the below logic in our map method.

foreach key-value pair
Search through each message for the word valentine.
Add it to the counter.
end return counter.

The next important thing that we should note is the output of the map function is also a key-value pair. So lets try to understand it in our HDFS perspective , where does this map function exactly run. In HDFS we have a task tracker at the data node. The map function basically runs on your block of data in the dataNode.The out put of the map function would be like 

We have another functionality provided by Hadoop called shuffle and sort. Lets try to understand how exactly it works. We will having several data nodes and our map function could run parallel on several data blocks in dataNodes. The shuffle and sort functionality provided by the hadoop will merge all the outputs or the processed data from the map functions to your reduce function.The entire process is monitored by the JobTracker of HDFS. All the mapreduce jobs has to be passed through JobTracker.

So final input to the reduce function could look like

So lets try to understand what the reduce function actually does. The reduce function also accepts input in the form of key-value pair. Just like our map function it also generates the out put as a key-value pair. So lets look what happens in our reduce method.

ForEach Map(Key,Value)
Add value to totalNoOfLovers
Return totalNoOfLovers

So basically our final output would be reduce(valentine,13) .

This is exactly how we use MapReduce  in Hadoop.

The logic is mainly written in java . However it also supports other languages. 

Since writing the programs using java in map-reduce model could be difficult as java is a generic programming language and the development time could be huge , Hadoop helps us with an Eco tool called PIG. PIG is mainly focused and built for data processing.

We can see more about PIG in my next blog.

Please leave your suggestions and feed back below.

Tuesday, February 19, 2013

Big Data Unrevealed : HDFS


 What is distributed filesystem ? 

Filesystems that manage the storage across a network of machines are called distributed filesystems. When the data grows in size there comes the requirement of going for multiple machines and to mange the files across these machines we need a distributed filesystem. Networking comes into picture in this model . Problems like Fault tolerance , performance bottlenecks come into the picture . Hadoop comes with its own distributed file system and it is called HDFS and it manages these issues.

Design of HDFS:

                        HDFS is designed to deal with large chunks of data with data access running on commodity hardware. It follows a model of write once and read many times. Data from different sources are copied to our HDFS system and analysis is carried over the large data. The use of commodity hardware comes with its own pros and cons . The pros being less expensive. Con being the failure rate is high since the commodity hardware are not the specialized ones and they can be from different vendors. The beauty of HDFS is it manages the vulnerability and will not let the user know about the failure of the hardware . So HDFS is basically fault tolerant.

HDFS systems can be divided into two different type of nodes. One is NameNode and DataNode. Every HDFS will have one NameNode and many DataNodes. 

Before understanding the different type of nodes available lets try to look into how the Hadoop file system stores data. Basically it stores data in blocks. The size of each block is 64 mb by default however it can be modified. The block size being 64 mb ensures that data block is not too big or too small to carry any data computation.

So if we try to push a 1 gb file into the hadoop file system it is stored as 16 blocks in the underlying machines. 

May be there is more to the splitting of file. We learned that our hadoop file system is fault tolerant . Lets see how it actually achieves it. It follows a concept of replication. We have a replication factor and the default size is 3. That is when we store our 1gb file into hdfs it actually occupies 3gb of data. Now we may have a question, will it actually burden or increase the data load. It does but there comes a lot of advantages. One advantage by this replication method is it helps the system to be fault tolerant. No two blocks of same data will lie in the same hardware. Thus in case one hard ware becomes corrupt we always have our data safe in other block.

The other important advantage is performance. Since we have data in different hardware , if one physical system containing a block of data is busy , computation on the data block can be done on another physical system thus improving the performance.

I was talking about two different type of nodes , lets look into it. First lets look into Data node . Data node is the physical system which is going to have our data blocks. So basically in our HDFS we can have any number of data nodes.

The other node is the name node . We need to keep track of the whereabouts of the file that is after splitting the data we should know exactly where a data block is residing in the different data nodes available in the HDFS. We need to keep all the meta data in a system and the physical system is called Name Node. So we can understand that name node acts as a master node and we will have only Name node in a HDFS. All other physical systems or the child nodes acts as slave nodes.

There are two more important things that we have to look into . One is JobTracker and the Task tracker. Job tracker is like a centralized tracker that assigns and it is the one which assigns and manages your mapreduce jobs.

Each data node will have its own task tracker and it is responsible for accomplishing map-reduce jobs in the data nodes.

So the next question is ....

What is map-reduce all about?

Please check my below blog on Hadoop MapReduce algorithm.

Hadoop MapReduce Algorithm

Please leave your valuble suggestions and comments ...

Tuesday, February 12, 2013

Big Data Unrevealed : A revelation


Is BIGDATA a hype around us or are we already into it. Lets find more about  this.

Lets go to history of computers . Why was computers actually used , the answer is to do simple calculations. Then came the need or desire to save data . So computers are made to have the  capability to store data . Many applications developed on computers started using the capability of this new found ability to store data. 

             Use of computers increased with the innovations in hardware and decrease in cost of hardware. New programming languages ,tools emerged which helped in building software . Use of computers in all the fields like retail , healthcare , automobiles to store their inventory , employee , sales and other important details increased. Thus paving way for data storage systems like RDBMS. This was successful till recently. Basically our RDBMS store data in tables and the data is structured. 

                 Over the years volume of data has increased enormously from few gigabyte to terabyte to petabytes . With the invent of new media social networking sites like Facebook , Twitter , Google+ new kind of data in the form of posts , images , videos are produced every second . The new data that is found is unstructured so our traditional database systems finds it difficult to handle. Companies started to track the internet user foot prints to analyze information about users. 

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.

                          The huge amount of data gathered provide some valuable insights into the history of organizations , interests of the consumers etc.. which provide some valuable information for the management to make strategic decisions.

                         The question that arises is how to use the huge amounts of data ?

Google did a research on it and came with a solution called MapReduce algorithm. MapReduce is used by Google and Yahoo to power their websearch. MapReduce was first described in a research paper from google.

The algorithm is basically developed on the concept of parallel processing. Parallel processing is use of multiple computers to compute the desired result instead of relying single powerful computer . MapReduce algorithm contains of two parts . As the name suggests it is "MAP" and "REDUCE". Let me explain more on this algorithm in a different blog.

                 As other company's will find it difficult implementing the algorithm for their problems and it is also like reinventing the wheel. There came a implementation of the algorithm called HADOOP .It is open sourced under Apache Foundation. The name itself is some thing unique. Try to guess what it could actually mean............




Yes you read it right it is the name of the toy elephant. The project was named after that . Doug was one of the core member of the project and HADOOP was his kid toy elephant name. Hadoop works over HDFS which is Hadoop file system. 

Lets discuss more about Hadoop and HDFS in coming blogs.


Note: BigData investments are growing enormously. And need for developers are also increasing. 

To know more check the below blogs

Hadoop Distributed File System 

Hadoop MapReduce Algorithm 


Leave your valuable comments for better blogging .





Saturday, February 2, 2013


We are about to develop an app that shows the image of the day from nasa website

 Before developing the app we need to understand what are rss feed and why do we use AsyncTask .

What is rss feed ? 

 RSS is a format for delivering regularly changing web content. Many news related sites , blogs ,online publishers deliver syndicate their content as rss feeds.

What is Async task ?

Async task is used for running long process like fetching rss feed from the internet , fetching data from database . Latest versions of android does not allow you to call long running process to run in UI thread. If you enforce heavy tasks on your UI thread it makes user have a feel of the application being hanged. The android system will terminate the app considering it not to be responsive.

Async task contains 4 methods .





We will be using doInBackground() method and onPostExecute() in our application.

 Choosing Rss Feed

Nasa provides us with different rss feeds , lets choose the nasa image of the day as our rss feed to our app nasa image of the day

Choosing the parser

Parsers are programs that read through the xml files .Android provides us with a number of XML parsers . Lets choose SAX parser for our app. Rss feeds are simple xml files.

Analyzing the rss feed 

Save the rss feed as an xml file to your local system. There are different tags inside the xml file. Out of the all tags in the xml file we will be using the below tags.



Published date

Url --> The link from which image can be downloaded.

Lets start with the app.

Create a new project in eclipse. I am using eclipse adt and android 4.2 SDK. 


You will have something similar to this in your new project. 

Lets create the UI for our cool app

               We need three  "text view" to display the title , published date and description .An "image view" to display the image.

You have all your basic methods like onCreate() in your in your res folder.

Please see the below handler that support our app to parse the rss feed received from  nasa image of the day feed using SAX parser .

Loading ....

Then you have your MainActivity java file in res folder.

In the MainActivity java file you have an inner class that extends AsyncTask which has implementations for doInBackGround() and onPostExecute() methods. 

In doInBackGround() we connect to nasa site with the help of class . We parse the xml using SAX parser and the IotdHandler will hold all our objects required for the app.

In postExecute() method we will set the values to the UI. That is we set the textView and images of the app. postExecute() is a method of async task and it accepts result of doInBackGround() as parameter.  

Take a look at the code in onCreate() of MainActivity we use an execute() method. An error would be thrown if you try to call any of the methods of AsyncTask directly.

Please find the below

Loading ....

In order for your app to connect to the internet we need to set permission to the AndroidManifest.xml file





This is my first blog , and it is still in progress. Please share your comments and views to improve the blog.