Comment on: Can someone more qualified kindly explain Hadoop to me?
1 15 Jul 2015 03:49 u/Sud35 in v/programmingComment on: Can someone more qualified kindly explain Hadoop to me?
I am tired and on my phone, so I will write something brief.
Hadoop is in essence is a distributed way of doing things. Underneath of it all is a distributed file system. A bunch of commodity level interconnected computers. This is called hdfs (hadoop dist file system). When you store some data, the data is split into smaller chunks and then stord into hdfs. Hdfs distributes these chunks on disks of different computers and creates an index. This index is held in. Separate sever called namenode.
Now when you want to do some processing, you submit a job with details like what files you want to operate on and what operations you want to do in these files. This job is then split into many smaller jobs (map) and the output from all these jobs is combined into a single result(reduce).
Advantages of this is that off the shelf commody hardware can be leveraged to make bigger and bigger clusters adding more processing power. So your cluster grows horizontally. Also the redundancy is built into it very cheaply. Failure of one node (computer) can be mitigated by just replacing that node with similar hardware.
Anything more, you can just ask.
No. App is not a right terminology. You write jobs that do certain things. Like for example, a db, you would write SQLs to extract data and do certain operations on that data. Similarly, your data resides in hadoop. So you write a job and submit it. The hadoop will take this job, spit it and then run it on the node where the data resides. So unlike a database, you are sending your job to the data. I case of database, you would pull the data out and do the processing on local computer. A different way of doing things.
As far as hdfs is concerned, you don't migrate your hdfs cluster. In aws, you will create a new hdfs cluster and then store the same data that you had stored locally.