Avijeet Dash: October 2014

Sunday, October 05, 2014

The Hadoop Puzzle

"Big data is at the foundation of all the megatrends happening today" - when I saw this here - It made perfect sense, with all the different applications of the hadoop technology. I try to explore a few trends in this blog.

hadoop core

At the heart of hadoop technology there are 2 parts

The distributed file system or HDFS
The data processing part using MR (Map-Reduce) on HDFS

And the overall master-worker clustering technology that makes everything work

where does hadoop fits in the new world?

Typically we have applications in the pattern of OLTP + OLAP

OLTP is the Transaction part of it based on RDBMS

OLAP is the DW+Analytics part of it based on also RDBMS Or Advanced MPP databases such as Teradata

so the typical flow of data is

OLTP DB --> (ETL) --> DW --> Analytics and Reporting

When you put hadoop in this flow, you get at least 3 types of new flows

1. OLTP DB, Other sources --> (ETL) --> Hadoop --> DW --> Analytics and Reporting

2. OLTP DB, Other sources --> (ETL) --> Hadoop --> No SQL DB --> Online Applications

3. OLTP DB, Other sources --> (ETL) --> Hadoop --> real-time Analytics

This can also be explained as

1. batch processing

2. online processing

3. real-time processing

batch processing

This is the classic application of MR (Map-Reduce) to processing HDFS data, MR code is written in Java. Pig Latin is a language developed by yahoo to generate MR code.

Hive is a SQL like database on top of HDFS also uses MR.

Mahout is a machine learning library on top of HDFS also uses MR.

This application basically fills in the gap in existing technologies to process large datasets.

online processing

This is a scenario where a No SQL database such as HBase, Cassandra is used. No SQL databases are distributed databases unlike RDBMS hence supporting infinite scale. HBase is based on HDFS but doesn’t use MR, Cassandra can be standalone or based on HDFS. Such a data store can be used as a backend of a web applciation. web log analysis can based on such an architecture.

This application basically fills in the gap in existing technologies to store large datasets for faster online access.

real-time processing

This is the in-memory option of faster data processing using products such Spark, Storm on top of HDFS, obviously they dont use MR. Spark can be based on cassandra as well without using HDFS.

This is the most exciting application of hadoop as it really can enable many new application styles and trends.

In summary, many permutation and combination of hadoop core to create many kind of applications and mega trends.

(my apologies for not expanding a lot of acronyms and not providing links, please google on any word if you found interesting - also please leave me a comment if any questions)

Avijeet Dash

Sunday, October 05, 2014

The Hadoop Puzzle

Popular Posts

Total Pageviews

Search This Blog

Blog Archive

My Blog List