Sunday, February 17, 2013

How to Learn Hadoop

Big data is one the hottest area in IT, so everyone is wanting to learn Hadoop.   I am constantly being asked how to get started.  So I want to share an approach I've been recommending and have gotten a lot of positive feedback on.  It's often hard to learn a new technology because a lot of the terminology, concepts and architecture approaches are new. Books, white papers, blogs usually try to teach Hadoop from a perspective of already knowing it.  These resources use terms, concepts and context that a newbie does not understand so it can make it very hard to learn a completely new subject.   So here is my recommendation for a way to learn Hadoop.

Learn the Basic Concepts First
Everyone gets in a hurry to learn a new technology, so they are trying to learn all the tricks and fancy stuff right away and do not build a solid foundation first.

Big data books.  Hadoop is all about the data.  Learn big data concepts before looking at Hadoop in any depth.  These books will build core data concepts around Hadoop.
  • Disruptive Possibilities: How Big Data Changes Everything,- This is a must read for anyone getting started in Big Data.
  • Big Data Now, 2012 Edition:  Easy read and good insights on Big Data.  Some of this content on companies is out of date, but there is a lot of valuable information here so this is still a good read.
  • Big Data, by Nathan Marx - The book does a great job of teaching core concepts, fundamentals and provides a great perspective of the data architecture of Hadoop.  This book will build a solid foundation and helps you understand the Lambda architecture.  You may need to get this book from MEAP if it has not released yet. (
A good way to learn basic concepts and terminology before looking at technology in more depth.  Both are short reads.   By reading these basic books, they will gently introduce you so when you read the more technical books you will understand them better.
  • Hadoop for Dummies, by Tim Jones - Easy introduction to learn basic concepts and terms.   
  • Big Data for Dummies, - This is a very gentle introduction to Big Data, concepts and technologies surrounding it.
Three Defining Whitepapers to Read
These papers are excellent papers to build fundamental knowledge around Hadoop and Hive.  Even though they are a few years old, the concepts and perspective discussed are excellent.  They will provide foundational insights into Hadoop.
Professional Training
Professional training is the quickest and easiest way to learn core concepts, fundamentals and get some hands on experience.   I do work at Hortonworks, but there are some specific reasons I  recommend Hortonworks University.  The reason is Hortonworks is all open source so you are not learning someone's proprietary or open-proprietary distribution.   By learning from 100% open source at Hortonworks you can learn from the open source base, which is then applicable to any distribution.  Also, Hadoop 2 has YARN which is a key foundational component and Hortonworks is driving the innovation and roadmap around YARN.

Additional Resources
Once you get the fundamental concepts down you will be wanting to learn in more detail.  The two books below are good for taking that next step.  However, I recommend reading them in parallel and bouncing back and forth.  The reason is each has areas that I believe they do a better job on.  Each book has sections that I prefer and using them together was very helpful for me.
  • Apache Hadoop Yarn (not released yet), by Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline
  • Hadoop The Definitive Guide (3rd Edition), by Tom White
  • Hadoop Operations, by Eric Sammer
Getting Hands on Experience and Learning Hadoop in Detail
A great way to start getting hands on experience and learning Hadoop through tutorials, videos and demonstrations is with the Hortonworks Sandbox.   The Hortonworks sandbox is designed for beginners, so it is an excellent platform for learning and skill development.   The tutorials, videos and demonstrations will be updated on a regular basis.   The sandbox is available in a Virtualbox or VMware virtual machine.  An additional 4GB of RAM and 2GB of storage is recommended for either of the virtual machines.  If you have a laptop that does not have a lot of memory you can go to the VM settings and cut the RAM for the VM down to about 1.5 - 2GB of RAM.  This is  likely to impact performance of the VM but it will help it at least run on a minimal configured laptop.

Other books to consider:
  • Programming Hive, by Edward Capriolo, Dean Wampler, ...
  • Programming Pig, by Alan Gates
Engineering Blogs:
Hadoop Ecosystem
What is Hadoop?

Have fun and I look forward to any additional recommendations.

1 comment:

  1. Hi,
    can you pls tell me how essential is Java knowledge in learning BigData & Hadoop.

    Can you pls let me know if Hadoop admin also need Java expertise.