BIG DATA : September 2016

Wednesday, 7 September 2016

Definition :

Big data is an extremely large data set that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.

Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate.

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

characteristics:

Volume:
      The quantity of generated and stored data.

Velocity:
                The speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.

Variety:
               The type and nature of the data.

Variability:
                   Inconsistency of the data set can hamper processes to handle and manage it.

Veracity:
                The quality of captured data can vary greatly, affecting accurate analysis.

Visualization:
                 The process of interpreting data in visual terms or of putting into visible form.

Value:
            Transforming data into the regard that something is held to deserve, the importance, worth, or usefulness of something.

Validity :
data quality, governance, master data management (MDM) on massive, diverse, distributed, heterogeneous, “unclean” data collections.

Venue:
               distributed, heterogeneous data from multiple platforms, from different owners’ systems, with different access and formatting requirements, private vs. public cloud.

Vocabulary :
                      schema, data models, semantics, ontologies, taxonomies, and other content- and context-based metadata that describe the data’s structure, syntax, content, and provenance.

Vagueness:
                    confusion over the meaning of big data

Tuesday, 6 September 2016

Hadoop

Hadoop is not a database. Hbase or Impala may be considered databases but Hadoop is just a file system (hdfs) with built in redundancy, parallelism.
Traditional databases/RDBMS have ACID properties - Atomicity, Consistency, Isolation and Durability. You get none of these out of the box with Hadoop. So if you have to for example write code to take money from one bank account and put into another one, you have to (painfully) code all the scenarios like what happens if money is taken out but a failure occurs before its moved into another account.
Hadoop offers massive scale in processing power and storage at a very low comparable cost to an RDBMS.
Hadoop offers tremendous parallel processing capabilities. You can run jobs in parallel to crunch large volumes of data.
Some people argue that traditional databases do not work well with un-structured data, but its not as simple as that. There are many applications built using traditional RDBMS that use a lot of unstructured data or video files or PDFs that I have come across that work well.
Typically RDBMS will manage a large chunk of the data in its cache for faster processing while at the same time maintaining read consistency across sessions. I would argue Hadoop does a better job at using the memory cache to process the data without offering any other items like read consistency.
Hive SQL is almost always a magnitude of times slower than SQL you can run in traditional databases. So if you are thinking SQL in Hive is faster than in a database, you are in for a sad disappointment. It will not scale at all for complex analytics.
Hadoop is very good for parallel processing problems - like finding a set of keywords in a large set of documents (this operation can be parallelized). However typically RDBMS implementations will be faster for comparable data sets.

Hadoop1x vs Hadoop 2x

Hadoop 1

Hadoop 1.x Supports only MapReduce (MR) processing model.it Does not support non-MR tools.

MR does both processing and cluster resource management.

1.x Has limited scaling of nodes. Limited to 4000 nodes per cluster.

Works on concepts of slots – slots can run either a Map task or a Reduce task only.

A single Namenode to manage the entire namespace.

1.x Has Single-Point-of-Failure (SPOF) – because of single Namenode- and in case of Namenode failure, needs manual intervention to overcome.

MR API is compatible with Hadoop 1x. A program written in Hadoop1 executes in Hadoop1x without any additional files.

1.x Has a limitation to serve as a platform for event processing, streaming and real-time operations.

Hadoop 2

Hadoop 2.x Allows to work in MR as well as other distributed computing models like Spark, Hama, Giraph, Message Passing Interface) MPI & HBase coprocessors.

YARN (Yet Another Resource Negotiator) does cluster resource management and processing is done using different processing models.

2.x Has better scalability. Scalable up to 10000 nodes per cluster.

Works on concepts of containers. Using containers can run generic tasks.

Multiple Namenode servers manage multiple namespace.

2.x Has feature to overcome SPOF with a standby Namenode and in case of Namenode failure, it is configured for automatic recovery.

MR API requires additional files for a program written in Hadoop1x to execute in Hadoop2x.

Can serve as a platform for a wide variety of data analytics-possible to run event processing, streaming and real time operations.

traditional DBMS vs DFS

Traditional DBMS&RDBMS:

	DBMS	RDBMS
1)	DBMS applications store data as file.	RDBMS applications store data in a tabular form.
2)	In DBMS, data is generally stored in either a hierarchical form or a navigational form.	In RDBMS, the tables have an identifier called primary key and the data values are stored in the form of tables.
3)	Normalization is not present in DBMS.	Normalization is present in RDBMS.
4)	DBMS does not apply any security with regards to data manipulation.	RDBMS defines the integrity constraint for the purpose of ACID (Atomocity, Consistency, Isolation and Durability) property.
5)	DBMS uses file system to store data, so there will be no relation between the tables.	in RDBMS, data values are stored in the form of tables, so a relationship between these data values will be stored in the form of a table as well.
6)	DBMS has to provide some uniform methods to access the stored information.	RDBMS system supports a tabular structure of the data and a relationship between them to access the stored information.
7)	DBMS does not support distributed database.	RDBMS supports distributed database.
8)	DBMS is meant to be for small organization and deal with small data. it supports single user.	RDBMS is designed to handle large amount of data. it supports multiple users.
9)	Examples of DBMS are file systems, xml etc.	Example of RDBMS are mysql, postgre, sql server, oracle etc.

RDBMS vs HADOOP:

Hadoop                                              RDBMS
Scale out                                             Scale Up
Key-Value Pair                                   Record
MapReduce (Functional Style)         SQL (Declarative)
De-normalized                                    Normalized
All varieties of Data                            Structured Data
OLAP/Batch/Analytical Queries       OLTP/Real time/ Point Queries

Architecture

Typical Analytical Architecture:

Holistic view of a Big Data system:

Architecture:

BIG DATA