A DBA's Journey into the Cloud and Big Data: How to Learn YARN and Hadoop 2

I previously wrote a blog on How to Learn Hadoop that got a lot of positive feedback. I've been getting a number of requests to update it for how to learn YARN and Hadoop 2. Everyone wants to learn the cool secrets and tricks but knowledge always starts with learning the fundamentals. My recommendations here are meant for the reader who is serious about learning Hadoop.

Learn the Basic Concepts First
To understand Hadoop you have to start by understanding Big Data.

Disruptive Possibilities: How Big Data Changes Everything,- This is a must read for anyone getting started in Big Data. It is an easy read, short and lays the foundation for Big Data and emphasizes why Hadoop is needed.

These are foundational whitepapers that explain the reasons behind the processing and distributed storage for Hadoop. These are not easy reads but when you get through them it will really help all your future learning around Hadoop because they have defined the "context for Hadoop". Some of the papers are older papers, but the core concepts the papers are discussing and the reasons behind them contain invaluable keys for understanding Hadoop.

You Have to Understand the Data

Hadoop clusters are built to process and analyze data. A Hadoop cluster becomes an important component in any enterprises data platforms, so you need to understand Hadoop from a data perspective.

Big Data, by Nathan Marx - The book does a great job of teaching core concepts, fundamentals and provides a great perspective of the data architecture of Hadoop. This book will build a solid foundation and helps you understand the Lambda architecture. You may need to get this book from MEAP if it has not released yet (http://www.manning.com/marz/). This book is scheduled for print release on March 28, 2014. Any DBA, Data Architect or anyone with a background in data warehousing and business intelligence should consider this a must read.

Additional Reading

We are in a transition period with Hadoop. Most of the books out today are on Hadoop 1.x and MapReduce v1 (classic). Hadoop 2 is GA, the distributed processing model is YARN and Tez will be an important part of processing data with Hadoop in the future. There are not a lot of books out yet on YARN and Hadoop 2 frameworks. You'll need to spend some time with the Hadoop documentation. :)

Apache Hadoop Yarn, by Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline
Hadoop Mapreduce v2 Cookbook (2nd Edition)
Hadoop The Definitive Guide (4rd Edition), by Tom White

Getting Hands on Experience and Learning Hadoop in Detail

A great way to start getting hands on experience and learning Hadoop through tutorials, videos and demonstrations is with Virtual Machines available from the different Hadoop Distribution vendors. These virtual machines or sandboxes are an excellent platform for learning and skill development. The tutorials, videos and demonstrations will be updated on a regular basis. The sandboxs are usually available in a Virtualbox, Hyper-V or VMware virtual machine. An additional 4GB of RAM and 2GB of storage is recommended for the virtual machines. If you have a laptop that does not have a lot of memory you can go to the VM settings and cut the RAM for the VM down to about 1.5 - 2GB of RAM. This is likely to impact performance of the VM but it will help it at least run on a minimal configured laptop.

Other books to consider:

There are now a lot of books out on Hadoop and the different frameworks as well as the NoSQL databases. You can find the right book that fits your personal reading style. There are also lots of Youtube videos. With a little time you can find ones of high quality.

Engineering Blogs:

/hortonworks.com/community/
/http://blog.cloudera.com/blog/
/blog.cloudera.com/blog/
/engineering.linkedln.com/hadoop
/engineering.twitter.com
/developer.yahoo.com/hadoop/

Hadoop Ecosystem

http://nosql.mypopescu.com/post/44639201775/the-hadoop-ecosystem-infographic

What is Hadoop?

Mark Madsen - http://www.insideanalysis.com/2012/12/what-hadoop-is-what-is-isnt/
Jim Walker - http://www.youtube.com/watch?v=j6toE6Ke7k4
Expert panel - http://www.infoq.com/articles/HadoopVirtualPanel

Best Practices for Apache Hadoop Hardware

Data

Analyzing Sentiment Data

Have fun and I look forward to any additional recommendations.

A DBA's Journey into the Cloud and Big Data

Saturday, January 4, 2014

How to Learn YARN and Hadoop 2

No comments:

Post a Comment