BIG DATA : Hadoop

Hadoop is not a database. Hbase or Impala may be considered databases but Hadoop is just a file system (hdfs) with built in redundancy, parallelism.
Traditional databases/RDBMS have ACID properties - Atomicity, Consistency, Isolation and Durability. You get none of these out of the box with Hadoop. So if you have to for example write code to take money from one bank account and put into another one, you have to (painfully) code all the scenarios like what happens if money is taken out but a failure occurs before its moved into another account.
Hadoop offers massive scale in processing power and storage at a very low comparable cost to an RDBMS.
Hadoop offers tremendous parallel processing capabilities. You can run jobs in parallel to crunch large volumes of data.
Some people argue that traditional databases do not work well with un-structured data, but its not as simple as that. There are many applications built using traditional RDBMS that use a lot of unstructured data or video files or PDFs that I have come across that work well.
Typically RDBMS will manage a large chunk of the data in its cache for faster processing while at the same time maintaining read consistency across sessions. I would argue Hadoop does a better job at using the memory cache to process the data without offering any other items like read consistency.
Hive SQL is almost always a magnitude of times slower than SQL you can run in traditional databases. So if you are thinking SQL in Hive is faster than in a database, you are in for a sad disappointment. It will not scale at all for complex analytics.
Hadoop is very good for parallel processing problems - like finding a set of keywords in a large set of documents (this operation can be parallelized). However typically RDBMS implementations will be faster for comparable data sets.

BIG DATA

Tuesday, 6 September 2016

Hadoop

No comments:

Post a Comment

Blog Archive

About Me