Sunday, October 05, 2014

The Hadoop Puzzle

"Big data is at the foundation of all the megatrends happening today" - when I saw this here - It made perfect sense, with all the different applications of the hadoop technology. I try to explore a few trends in this blog.

hadoop core
At the heart of hadoop technology there are 2 parts
  • The distributed file system or HDFS
  • The data processing part using MR (Map-Reduce) on HDFS
And the overall master-worker clustering technology that makes everything work 

where does hadoop fits in the new world?
Typically we have applications in the pattern of OLTP + OLAP
OLTP is the Transaction part of it based on RDBMS
OLAP is the DW+Analytics part of it based on also RDBMS Or Advanced MPP databases such as Teradata
so the typical flow of data is
OLTP DB --> (ETL) --> DW --> Analytics and Reporting
When you put hadoop in this flow, you get at least 3 types of new flows
1.     OLTP DB, Other sources --> (ETL) --> Hadoop --> DW --> Analytics and Reporting
2.     OLTP DB, Other sources --> (ETL) --> Hadoop --> No SQL DB --> Online Applications
3.     OLTP DB, Other sources --> (ETL) --> Hadoop --> real-time Analytics
This can also be explained as
1.     batch processing
2.     online processing
3.     real-time processing

batch processing
This is the classic application of MR (Map-Reduce) to processing HDFS data, MR code is written in Java. Pig Latin is a language developed by yahoo to generate MR code.
Hive is a SQL like database on top of HDFS also uses MR.
Mahout is a machine learning library on top of HDFS also uses MR.
This application basically fills in the gap in existing technologies to process large datasets.

online processing
This is a scenario where a No SQL database such as HBase, Cassandra is used. No SQL databases are distributed databases unlike RDBMS hence supporting infinite scale. HBase is based on HDFS but doesn’t use MR, Cassandra can be standalone or based on HDFS. Such a data store can be used as a backend of a web applciation. web log analysis can based on such an architecture.
This application basically fills in the gap in existing technologies to store large datasets for faster online access.

real-time processing
This is the in-memory option of faster data processing using products such Spark, Storm on top of HDFS, obviously they dont use MR. Spark can be based on cassandra as well without using HDFS.
This is the most exciting application of hadoop as it really can enable many new application styles and trends.

In summary, many permutation and combination of hadoop core to create many kind of applications and mega trends.

(my apologies for not expanding a lot of acronyms and not providing links, please google on any word if you found interesting - also please leave me a comment if any questions)