"Big data is at the
foundation of all the megatrends happening today" - when I saw this here - It made perfect
sense, with all the different applications of the hadoop technology. I try to
explore a few trends in this blog.
hadoop core
At the heart of hadoop
technology there are 2 parts
- The distributed file system or HDFS
- The data processing part using MR (Map-Reduce) on HDFS
And the overall
master-worker clustering technology that makes everything work
where does hadoop fits
in the new world?
Typically we have
applications in the pattern of OLTP + OLAP
OLTP is the Transaction
part of it based on RDBMS
OLAP is the DW+Analytics
part of it based on also RDBMS Or Advanced MPP databases such as Teradata
so the typical flow of
data is
OLTP DB --> (ETL)
--> DW --> Analytics and Reporting
When you put hadoop in
this flow, you get at least 3 types of new flows
1.
OLTP DB, Other sources
--> (ETL) --> Hadoop --> DW --> Analytics and Reporting
2.
OLTP DB, Other sources
--> (ETL) --> Hadoop --> No SQL DB --> Online Applications
3.
OLTP DB, Other sources
--> (ETL) --> Hadoop --> real-time Analytics
This can also be
explained as
1.
batch processing
2.
online processing
3.
real-time processing
batch processing
This is the classic
application of MR (Map-Reduce) to processing HDFS data, MR code is written in
Java. Pig Latin is a language developed by yahoo to generate MR code.
Hive is a SQL like
database on top of HDFS also uses MR.
Mahout is a machine
learning library on top of HDFS also uses MR.
This application
basically fills in the gap in existing technologies to process large datasets.
online processing
This is a scenario where
a No SQL database such as HBase, Cassandra is used. No SQL databases are
distributed databases unlike RDBMS hence supporting infinite scale. HBase is
based on HDFS but doesn’t use MR, Cassandra can be standalone or based on HDFS.
Such a data store can be used as a backend of a web applciation. web log
analysis can based on such an architecture.
This application
basically fills in the gap in existing technologies to store large datasets for
faster online access.
real-time processing
This is the in-memory
option of faster data processing using products such Spark, Storm on top of
HDFS, obviously they dont use MR. Spark can be based on cassandra as well
without using HDFS.
This is the most
exciting application of hadoop as it really can enable many new application
styles and trends.
In summary, many
permutation and combination of hadoop core to create many kind of applications
and mega trends.
(my apologies for not
expanding a lot of acronyms and not providing links, please google on any word
if you found interesting - also please leave me a comment if any questions)