Wednesday, June 26, 2013

Hadoop - It's All About The Data

A key point to understand about Hadoop is that it's all about the data.  Don't lose focus.  It's easy to get hung up on Hive, Pig, HBase, HCatalog and lose sight of designing the right data architecture.  Also, if you have a strong background in data warehouse design, BI, analytics, etc.  all those skills are transferable to Hadoop.  Hadoop just takes data warehousing to new levels of scalability and agility with reduction of business latency while working with data sets ranging from structured to unstructured data.  Hadoop 2.0 and YARN are going to move Hadoop deep into the enterprise and allow organizations to make faster and more accurate business decisions.  The ROI of Hadoop is multiple factors higher than the traditional data warehouse.  Companies should be extremely nervous about being out Hadooped by their competition.

 Newbies often look at Hadoop with wide eyes versus understanding that Hadoop has a lot of components to it that they already understand such as: clustering, distributed file systems, parallel processing, batch and stream processing.

A few key success factors for a Hadoop project are:
  • Start with a good data design using a scalable reference architecture.
  • Building successful analytical models that provide business value.
  • Be aggressive in reducing the latency between data hitting the disk and leveraging business value from that data. 
The ETL strategies and data set generation for Hadoop is similar to what you are going to want to do in your Hadoop cluster.  It's important to look at your data warehouse and understanding how your enterprise data strategy is going to evolve with Hadoop now a part of the ecosystem.



"Hadoop cannot be an island, it must integrate with Enterprise Data Architecture". - HadoopSummit





"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network." - Merv Adrian at Gartner Research


This is a sample Hadoop 1.x cluster so you can see the key software processes that make up Hadoop.  The good point of this diagram is that if you understand it you are probably worth another $20-30k.  :) 




YARN (Hadoop 2.0) is the distributed operating system of the future. YARN allows you to run multiple applications in Hadoop, all sharing a common resource management.  YARN is going to disrupt the data industry to a level not seen since the .com days. 

A Hadoop cluster will usually have multiple data layers.

  • Batch Layer: Raw data is loaded into a data set that is immutable so it becomes your source of truth. Data scientists and analysts can start working with this data as soon as it hits the disk.  
  • Serving Layer: Just as in a traditional data warehouse, this data is often massaged, filtered and transformed into a data set that is easier to do analytics on.  Unstructured and semi-structured data will be put into a data set that is easier to work with. Metadata is then attached to this data layer using HCatalog so users can access the data in the HDFS files using abstract table definitions.   
  • Speed Layer: To optimize the data access and performance often additional data sets (views) are calculated to create a speed layer.  HBase can be used for this layer dependent on the requirements.


This diagram emphasizes two key points:
  • The different data layers you will have in your Hadoop cluster.
  • The importance of building your metadata layer (HCatalog).
With the massive scalability of Hadoop, you need to be able to automate as much as possible and manage the data in your cluster.  This is where Falcon is going to play a key role.  Falcon is a data lifecycle management framework that provides the data orchestration, disaster recovery as well as data retention you need to manage your data.

No comments:

Post a Comment