Newbies often look at Hadoop with wide eyes versus understanding that Hadoop has a lot of components to it that they already understand such as: clustering, distributed file systems, parallel processing, batch and stream processing.
A few key success factors for a Hadoop project are:
- Start with a good data design using a scalable reference architecture.
- Building successful analytical models that provide business value.
- Be aggressive in reducing the latency between data hitting the disk and leveraging business value from that data.
"Hadoop cannot be an island, it must integrate with Enterprise Data Architecture". - HadoopSummit
"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network." - Merv Adrian at Gartner Research
This is a sample Hadoop 1.x cluster so you can see the key software processes that make up Hadoop. The good point of this diagram is that if you understand it you are probably worth another $20-30k. :)
YARN (Hadoop 2.0) is the distributed operating system of the future. YARN allows you to run multiple applications in Hadoop, all sharing a common resource management. YARN is going to disrupt the data industry to a level not seen since the .com days.
A Hadoop cluster will usually have multiple data layers.
- Batch Layer: Raw data is loaded into a data set that is immutable so it becomes your source of truth. Data scientists and analysts can start working with this data as soon as it hits the disk.
- Serving Layer: Just as in a traditional data warehouse, this data is often massaged, filtered and transformed into a data set that is easier to do analytics on. Unstructured and semi-structured data will be put into a data set that is easier to work with. Metadata is then attached to this data layer using HCatalog so users can access the data in the HDFS files using abstract table definitions.
- Speed Layer: To optimize the data access and performance often additional data sets (views) are calculated to create a speed layer. HBase can be used for this layer dependent on the requirements.
This diagram emphasizes two key points:
- The different data layers you will have in your Hadoop cluster.
- The importance of building your metadata layer (HCatalog).