Learn the Basic Concepts First
To understand Hadoop you have to start by understanding Big Data.
- Disruptive Possibilities: How Big Data Changes Everything,- This is a must read for anyone getting started in Big Data. It is an easy read, short and lays the foundation for Big Data and emphasizes why Hadoop is needed.
These are foundational whitepapers that explain the reasons behind the processing and distributed storage for Hadoop. These are not easy reads but when you get through them it will really help all your future learning around Hadoop because they have defined the "context for Hadoop". Some of the papers are older papers, but the core concepts the papers are discussing and the reasons behind them contain invaluable keys for understanding Hadoop.
- Apache Hadoop YARN: Yet Another Resource Negotiator
- MapReduce: Simplified Data Processing on Large Clusters
- The Hadoop Distributed File System
- Hive - A Petabyte Scale Data Warehouse using Hadoop
Hadoop clusters are built to process and analyze data. A Hadoop cluster becomes an important component in any enterprises data platforms, so you need to understand Hadoop from a data perspective.
- Big Data, by Nathan Marx - The book does a great job of teaching core concepts, fundamentals and provides a great perspective of the data architecture of Hadoop. This book will build a solid foundation and helps you understand the Lambda architecture. You may need to get this book from MEAP if it has not released yet (http://www.manning.com/marz/). This book is scheduled for print release on March 28, 2014. Any DBA, Data Architect or anyone with a background in data warehousing and business intelligence should consider this a must read.
We are in a transition period with Hadoop. Most of the books out today are on Hadoop 1.x and MapReduce v1 (classic). Hadoop 2 is GA, the distributed processing model is YARN and Tez will be an important part of processing data with Hadoop in the future. There are not a lot of books out yet on YARN and Hadoop 2 frameworks. You'll need to spend some time with the Hadoop documentation. :)
- Apache Hadoop Yarn, by Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline
- Hadoop Mapreduce v2 Cookbook (2nd Edition)
- Hadoop The Definitive Guide (4rd Edition), by Tom White
A great way to start getting hands on experience and learning Hadoop through tutorials, videos and demonstrations is with Virtual Machines available from the different Hadoop Distribution vendors. These virtual machines or sandboxes are an excellent platform for learning and skill development. The tutorials, videos and demonstrations will be updated on a regular basis. The sandboxs are usually available in a Virtualbox, Hyper-V or VMware virtual machine. An additional 4GB of RAM and 2GB of storage is recommended for the virtual machines. If you have a laptop that does not have a lot of memory you can go to the VM settings and cut the RAM for the VM down to about 1.5 - 2GB of RAM. This is likely to impact performance of the VM but it will help it at least run on a minimal configured laptop.
There are now a lot of books out on Hadoop and the different frameworks as well as the NoSQL databases. You can find the right book that fits your personal reading style. There are also lots of Youtube videos. With a little time you can find ones of high quality.
- Mark Madsen - http://www.insideanalysis.com/2012/12/what-hadoop-is-what-is-isnt/
- Jim Walker - http://www.youtube.com/watch?v=j6toE6Ke7k4
- Expert panel - http://www.infoq.com/articles/HadoopVirtualPanel
Have fun and I look forward to any additional recommendations.