Saturday, January 4, 2014

How to Learn YARN and Hadoop 2

I previously wrote a blog on How to Learn Hadoop that got a lot of positive feedback.  I've been getting a number of requests to update it for how to learn YARN and Hadoop 2.   Everyone wants to learn the cool secrets and tricks but knowledge always starts with learning the fundamentals.   My recommendations here are meant for the reader who is serious about learning Hadoop.

Learn the Basic Concepts First
To understand Hadoop you have to start by understanding Big Data.
These are foundational whitepapers that explain the reasons behind the processing and distributed storage for Hadoop. These are not easy reads but when you get through them it will really help all your future learning around Hadoop because they have defined the "context for Hadoop".  Some of the papers are older papers, but the core concepts the papers are discussing and the reasons behind them contain invaluable keys for understanding Hadoop.
You Have to Understand the Data
Hadoop clusters are built to process and analyze data.   A Hadoop cluster becomes an important component in any enterprises data platforms, so you need to understand Hadoop from a data perspective.
  • Big Data, by Nathan Marx - The book does a great job of teaching core concepts, fundamentals and provides a great perspective of the data architecture of Hadoop.  This book will build a solid foundation and helps you understand the Lambda architecture.  You may need to get this book from MEAP if it has not released yet (http://www.manning.com/marz/).   This book is scheduled for print release on March 28, 2014. Any DBA, Data Architect or anyone with a background in data warehousing and business intelligence should consider this a must read.
Additional Reading
We are in a transition period with Hadoop.  Most of the books out today are on Hadoop 1.x  and MapReduce v1 (classic).  Hadoop 2 is GA, the distributed processing model is YARN and Tez will be an important part of processing data with Hadoop in the future. There are not a lot of books out yet on YARN and Hadoop 2 frameworks.  You'll need to spend some time with the Hadoop documentation. :)
  • Apache Hadoop Yarn, by Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline
  • Hadoop Mapreduce v2 Cookbook (2nd Edition)
  • Hadoop The Definitive Guide (4rd Edition), by Tom White 
Getting Hands on Experience and Learning Hadoop in Detail
A great way to start getting hands on experience and learning Hadoop through tutorials, videos and demonstrations is with Virtual Machines available from the different Hadoop Distribution vendors. These virtual machines or sandboxes are an excellent platform for learning and skill development.   The tutorials, videos and demonstrations will be updated on a regular basis.   The sandboxs are usually available in a Virtualbox, Hyper-V or VMware virtual machine.  An additional 4GB of RAM and 2GB of storage is recommended for the virtual machines.  If you have a laptop that does not have a lot of memory you can go to the VM settings and cut the RAM for the VM down to about 1.5 - 2GB of RAM.  This is  likely to impact performance of the VM but it will help it at least run on a minimal configured laptop.
Other books to consider:
There are now a lot of books out on Hadoop and the different frameworks as well as the NoSQL databases.  You can find the right book that fits your personal reading style.  There are also lots of Youtube videos.  With a little time you can find ones of high quality. 

Engineering Blogs:
  • /hortonworks.com/community/
  • /http://blog.cloudera.com/blog/
  • /blog.cloudera.com/blog/
  • /engineering.linkedln.com/hadoop
  • /engineering.twitter.com
  • /developer.yahoo.com/hadoop/

Have fun and I look forward to any additional recommendations.

No comments:

Post a Comment