Can someone more qualified kindly explain Hadoop to me?

18    15 Jul 2015 02:34 by u/SevereHeebieJeebies

I am by, both education and experience, a web/app developer who knows a good bit of SQL. I am not, however, adept at the concepts of big data. My job is moving towards a system which can handle large amounts of data more efficiently than SQL and to the execs, open source hadoop is preferred over the priceier alternatives. I have tried reading the hadoop documentation but I got lost, ended up at map-reduce and gave up on the formal documentation. Can one of you fine programmers provide me an explanation on what hadoop is and how to use it, or point me towards hadoop learning resources?

14 comments

4

I am tired and on my phone, so I will write something brief.

Hadoop is in essence is a distributed way of doing things. Underneath of it all is a distributed file system. A bunch of commodity level interconnected computers. This is called hdfs (hadoop dist file system). When you store some data, the data is split into smaller chunks and then stord into hdfs. Hdfs distributes these chunks on disks of different computers and creates an index. This index is held in. Separate sever called namenode.

Now when you want to do some processing, you submit a job with details like what files you want to operate on and what operations you want to do in these files. This job is then split into many smaller jobs (map) and the output from all these jobs is combined into a single result(reduce).

Advantages of this is that off the shelf commody hardware can be leveraged to make bigger and bigger clusters adding more processing power. So your cluster grows horizontally. Also the redundancy is built into it very cheaply. Failure of one node (computer) can be mitigated by just replacing that node with similar hardware.

Anything more, you can just ask.

0

So If I have a dev environment that is just one machine for testing, and I build my app, and then want to deploy to Amazon Cloud Services, is transferring the HDFS a "simple" task?

1

No. App is not a right terminology. You write jobs that do certain things. Like for example, a db, you would write SQLs to extract data and do certain operations on that data. Similarly, your data resides in hadoop. So you write a job and submit it. The hadoop will take this job, spit it and then run it on the node where the data resides. So unlike a database, you are sending your job to the data. I case of database, you would pull the data out and do the processing on local computer. A different way of doing things.

As far as hdfs is concerned, you don't migrate your hdfs cluster. In aws, you will create a new hdfs cluster and then store the same data that you had stored locally.

2
1

As far as I was able to figure out, Hadoop is like running a clustered database/filesystem with distributed RAID. So really, I learned nothing. Of course the execs like hadoop... it looks "free"

0

Lol yah, until they have to pay for professional support...

1

I helped some exchange students over the summer setup and modify Hadoop to use a real-time scheduler.

Here is my high level description (I live leave the in-depth for others)

 

High Level Description

Hadoop is software which is installed on multiple machines for the purpose of running jobs which can be divided into multiple parts.

What is it for

You have a problem and it takes a long time to solve. Luck for you that your problem can be broken up into parts and worked on separately.

Hadoop wants

You write a program that describes how to break your problem up (map) and also how to put the different solutions back together (reduce).

How

Hadoop takes your program and uses the map portion to tell a machine what they should do. When done the reduce portion is used to take these partial solutions and create the actual solution you want.