A DBA's Journey into the Cloud and Big Data: Hadoop

Showing posts with label Hadoop. Show all posts

Sunday, September 21, 2014

The Evolving Enterprise Solution for Data

Innovation Around Big Data is Creating Choice
The Modern Enterprise Big Data Platform has been referred to by many names. Names like Modern Data Lake, Enterprise Data Hub, Marshal Data Yard and Virtual Data Lake to name a few. Each name is associated with a defining characteristic, philosophy or goal. Big data platforms are evolving at an amazing speed due in large part to the interest around big data as well as the innovation of open source. This innovation is creating a lot of choice as well as a lot of confusion. The decisions are not easy around the choice of distributions, frameworks, reference architectures, NoSQL databases, real-time access, data governance, etc.

A Blended Solution around Data?
Hadoop and NoSQL are adding functionality that currently exists in RDBMS and EDW platforms. RDBMS and EDW platforms are adding feature and functionality that exists in Hadoop and NoSQL as well as adding connectors that support data integration with big data platforms. Map Reduce applications or R scripts can run in some relational databases. It’s now possible to execute a join where some of the data resides in a RDBMS/EDW and other data resides in Hadoop or NoSQL. Where should the data reside? Who should own the SQL statement. The Modern Enterprise Data Platform is not a static platform. It is instead a platform that is taking on new forms and functionality. Organization needs to look on how to design a flexible enterprise environment that can leverage the features and functionality of all data platforms and meet the current/future needs of the organization.

Data Needs to be Consumable and Actionable

The problems to be solved are not just around Hadoop, NoSQL, NewSQL, RDBMS, EDWs or even about the data. The goal is to improve decision making and business insight. Organizations need to be able to make business decisions faster, improving the accuracy and reducing the risk of business decisions. To be able to handle the data volume, velocity and variety for data cost-effectively and efficiently. The management and governance of data needs to take into consideration the evolution of these data platforms and how to ensure the data is consumable and actionable.

Improving Business Insight

Increasing business insight by improving analytics is one of the goals of big data. One step in achieving this goal is by reducing the amount of data silos. It’s also important to make sure we do not rebuild the data silos in big data platforms. Be aware, the core designs for a RDBMS, EDW, Hadoop cluster and NoSQL database platforms were created for different reasons. A Hadoop cluster is not ACID compliant, a NoSQL database is not relational and an RDBMS cannot scale at cost the way a Hadoop cluster can. One needs to look at the key business goals and use cases to leverage the features of all the data platforms to achieve the strategic goals around data.

Saturday, April 5, 2014

Open Source Driving Innovation of Enterprise Hadoop

In the last seven months we have seen a tremendous level of innovation and maturity in the enterprise Hadoop platform. Hortonwork's HDP 2.0 and HDP 2.1 releases are showing the tremendous innovation being driven by open source today. This innovation is significantly improving the enterprise capabilities of Hadoop and is changing the landscape of Hadoop. It is difficult for proprietary releases of Hadoop to compete with the hundreds of thousands of lines of code being written by the Hadoop open source community. Organizations ranging from Microsoft to Yahoo are adding their expertise and knowledge to the open source community. We are seeing proprietary and open source/proprietary solutions of Hadoop be put under tremendous pressure by the innovation of open source and seeing Hadoop distributions that are not 100% open source begin to disappear.

With HDP 2.0 and 2.1 there are a number of game changing capabilities added to Hadoop. These new releases have added comprehensive capabilities in areas such as scalability, multi-tenancy, performance, security, data lifecycle management, data governance, encryption, interactive query, high availability and fault tolerance. Key additions include:
HDP 2.0:

YARN - a distributed data operating system supporting applications with different run time. characteristics. YARN also adds scalability and improved fault tolerance to Hadoop.
NameNode High Availability.
Hadoop scalability to 10,000+ nodes.
New releases of Hadoop frameworks in key areas such as Hive and HBase.

HDP 2.1:

Interactive query capability in Hadoop. The Stinger project has increased the performance of interactive queries by 100 times with Hive optimization, container optimization, Tez integration and in-memory cache
Hive has improved SQL compliance.
Perimeter security added to Hadoop with Knox. Enterprise Hadoop offers authorization, authentication and encryption.
Data Lifecycle Management and data governance with Falcon.
Enhanced HDFS security and multi-tenancy capabilities.
Resource Manager High Availability
NameNode Federation improving scalability and multi-tenancy and stronger support of different run time characteristics.
Linux and Windows releases synched.
HDP search with Apache Solr increases the capabilities of Hadoop.
Storm providing scalability streaming to Hadoop.
Spark is available under Tech Preview to provide real time in-memory processing.

Ambari:

Splitting the management interface Ambari with the HDP distribution. The management tool and the Hadoop software distribution can be rev'd separately.
Support of software stacks Storm, Tez and Falcon.
Maintenance mode silences alerts for services, hosts and components for administration work.
Rolling restarts.
Service and component restarts.
Support of zookeeper configurations.
Supports decommissioning of NodeManagers and RegionServers.
Ability to refresh client-only configurations

Saturday, March 15, 2014

Succeeding with Big Data Projects: The Secret Sauce

The architectures and software frameworks being used for big data projects are constantly evolving. Modern data lakes are consistently using Storm for real-time streaming, NoSQL databases like HBase, Accumulo and Cassandra for low-latency data access and Kafka for message processing. Open source software such as Centos, MySQL, Ganglia and Nagios are making deeper penetration in large enterprises. I am also seeing Python and JavaScript becoming more popular. Linux containers and Docker are being looked at in the future to increase hardware consolidation and utilization.

The Netflix data architecture is reflective of the design patterns organizations are looking at.

Over the next two years we will see a blending of SQL and NoSQL databases. The Stinger project (Hive optimization and Tez) have brought interactive query capability to the batch processing environments of Hadoop. Which means the way organizations are using Hadoop is changing quickly as well. Real-time query and ACID capabilities are next in the list of customer requests. As data lakes are defining the modern data architecture platform and more and more data gets stored in Hadoop, organizations are wanting to use data in lots of different ways.

Successful Big data projects have consistent patterns of success (the secret sauce). The technical infrastructure teams will be able to work with vendors to get the right hardware, stand up big data platforms and maintain them. However, big data projects can easily become science projects if the following is not addressed.

Thought leadership that creates cultural change so an organization can innovate successfully. Big data is about making better business decisions faster with higher degrees of accuracy. A sense of urgency needs to exist.
An environment of collaboration and teamwork with everyone believing in a vision. The modern data lake helps to eliminate a lot of the technology and data silos that exist across different platforms and business units. Successful big data project environments eliminate the social, territorial and political silos that often exist in traditional teams.
A strong emphasis in data/schema design and ETL reference architectures. It's still all about the data. :)
The ability to build a plane while flying it. Big data technologies, environments, frameworks and methodologies are evolving quickly. Organizations need to be able to adapt and learn fast.

"Extinction is the rule. Survival is the exception." was a quote from Carl Sagan. Being able to transform an organization into big data is one the biggest challenges an organization faces. Everyone is concerned about the development of the technical skills to succeed with big data, however the development of the internal people is just as important.

Saturday, January 25, 2014

Choosing MySQL or Oracle for your Hadoop Repositories

When setting up a Hadoop cluster the administration team has to decide which relational databases to use for the Hadoop metadata repositories. I strongly recommend that one type of relational database be used for all the repositories instead of using different database vendors for different frameworks.

Hadoop requires metadata repositories (relational databases) for Ambari (management), HiveServer2 (SQL), Oozie (scheduler and workflow tool) and Hue (Hadoop UI). Choices include Postgres, MySQL, Oracle or derby. The databases holding the Hadoop metadata repositories have to be backed up and maintained like any other database server.

I recommend using MySQL for the following reasons:

Oracle is too heavyweight of a database server that it's full resources will not be utilized. The Oracle database server will take extra memory, disk space and CPU that will not be taken advantage of.
Postgres is a good solid database but it has no tipping point. I do not see a lot of Postgres databases when I go to customers and I do not see Postgres increasing in the market.
Derby (used with Ozzie) and SQLite (used with Hue) are not robust enough to be used in a heavy production environment. I would only use these databases if I was going to create a small Hadoop cluster for personal development.

MySQL has a lot of features that make it ideal as a database repository for different Hadoop frameworks. They include:

Extremely fast and lightweight.
Relatively easy to administer and backup.
Replication is very easy to set up and maintain.
MySQL has extremely high adoption and it is easy to find resources to manage it.

If Oracle is the corporate standard and the database and Hadoop administration team prefer to use Oracle, I have provided links for setting up Oracle for the primary Hadoop frameworks.

Main documents page: http://docs.hortonworks.com/
Reference Guides - Supported Database Matrix: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_reference/content/db-support-matrix.html
Oracle Steps for Ambari: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-3.html
Oracle Steps for Hive: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-1.html
Oracle for Oozie: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-2.html

Saturday, January 4, 2014

How to Learn YARN and Hadoop 2

I previously wrote a blog on How to Learn Hadoop that got a lot of positive feedback. I've been getting a number of requests to update it for how to learn YARN and Hadoop 2. Everyone wants to learn the cool secrets and tricks but knowledge always starts with learning the fundamentals. My recommendations here are meant for the reader who is serious about learning Hadoop.

Learn the Basic Concepts First
To understand Hadoop you have to start by understanding Big Data.

Disruptive Possibilities: How Big Data Changes Everything,- This is a must read for anyone getting started in Big Data. It is an easy read, short and lays the foundation for Big Data and emphasizes why Hadoop is needed.

These are foundational whitepapers that explain the reasons behind the processing and distributed storage for Hadoop. These are not easy reads but when you get through them it will really help all your future learning around Hadoop because they have defined the "context for Hadoop". Some of the papers are older papers, but the core concepts the papers are discussing and the reasons behind them contain invaluable keys for understanding Hadoop.

You Have to Understand the Data

Hadoop clusters are built to process and analyze data. A Hadoop cluster becomes an important component in any enterprises data platforms, so you need to understand Hadoop from a data perspective.

Big Data, by Nathan Marx - The book does a great job of teaching core concepts, fundamentals and provides a great perspective of the data architecture of Hadoop. This book will build a solid foundation and helps you understand the Lambda architecture. You may need to get this book from MEAP if it has not released yet (http://www.manning.com/marz/). This book is scheduled for print release on March 28, 2014. Any DBA, Data Architect or anyone with a background in data warehousing and business intelligence should consider this a must read.

Additional Reading

We are in a transition period with Hadoop. Most of the books out today are on Hadoop 1.x and MapReduce v1 (classic). Hadoop 2 is GA, the distributed processing model is YARN and Tez will be an important part of processing data with Hadoop in the future. There are not a lot of books out yet on YARN and Hadoop 2 frameworks. You'll need to spend some time with the Hadoop documentation. :)

Apache Hadoop Yarn, by Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline
Hadoop Mapreduce v2 Cookbook (2nd Edition)
Hadoop The Definitive Guide (4rd Edition), by Tom White

Getting Hands on Experience and Learning Hadoop in Detail

A great way to start getting hands on experience and learning Hadoop through tutorials, videos and demonstrations is with Virtual Machines available from the different Hadoop Distribution vendors. These virtual machines or sandboxes are an excellent platform for learning and skill development. The tutorials, videos and demonstrations will be updated on a regular basis. The sandboxs are usually available in a Virtualbox, Hyper-V or VMware virtual machine. An additional 4GB of RAM and 2GB of storage is recommended for the virtual machines. If you have a laptop that does not have a lot of memory you can go to the VM settings and cut the RAM for the VM down to about 1.5 - 2GB of RAM. This is likely to impact performance of the VM but it will help it at least run on a minimal configured laptop.

Other books to consider:

There are now a lot of books out on Hadoop and the different frameworks as well as the NoSQL databases. You can find the right book that fits your personal reading style. There are also lots of Youtube videos. With a little time you can find ones of high quality.

Engineering Blogs:

/hortonworks.com/community/
/http://blog.cloudera.com/blog/
/blog.cloudera.com/blog/
/engineering.linkedln.com/hadoop
/engineering.twitter.com
/developer.yahoo.com/hadoop/

Hadoop Ecosystem

http://nosql.mypopescu.com/post/44639201775/the-hadoop-ecosystem-infographic

What is Hadoop?

Mark Madsen - http://www.insideanalysis.com/2012/12/what-hadoop-is-what-is-isnt/
Jim Walker - http://www.youtube.com/watch?v=j6toE6Ke7k4
Expert panel - http://www.infoq.com/articles/HadoopVirtualPanel

Best Practices for Apache Hadoop Hardware

Data

Analyzing Sentiment Data

Have fun and I look forward to any additional recommendations.

Sunday, November 10, 2013

Thoughts on HDP2 and the Evolving Ecosystem around Hadoop

I've been working with in-memory databases and Hadoop since my days at VMware as a Tier One Specialist. I've spent the last year focusing 100% of my attention on Hortonworks Data Platform (HDP) and NoSQL databases. In the last few months I've done a very deep immersion of HDP2 and all the new features around Apache Hadoop 2 from the HDP perspective. As well as seeing the changes in the ecosystem around Hadoop.

The analogy I've told my Oracle friends, is that HDP2 is transformational to HDP1 that same way Oracle 8 was to Oracle 7. Oracle 8 opened up lots of new functionality and features that changed what Oracle could do for businesses. Oracle RAC, Streams, Data Guard, Partitioning were the beginning of lots of new features that changed the way companies could use database software. HDP2 will have the same type of transformation on Apache Hadoop 1 customers. It's not just that HDP2 has new features, scalability, performance enhancements and high availability. It's that HDP2 is going to change how customers will use Hadoop. When I look at features like YARN, Knox (Security), Tez (real-time queries), Falcon (Data Lifecycle Management) and Accumulo, they completely change the potential and way Hadoop will be used. HDP2 is definitely not your grandfather's version of Hadoop. :) Then you look at the growth of the Hadoop ecosystem with new features and products from Spark, Storm, Kafka, Splunk, WanDisco, Rackspace, etc. Software products in the Hadoop ecosystem are transforming and evolving as fast as HDP. You also look at Microsoft (HDInsight) and Rackspace (Openstack) and you see the needle will move on Hadoop being used in the cloud. Last you look at the connectors, loaders and interfaces being written by the database vendors as well as the products coming from Informatica, Ab Initio and Quest and you see everyone is all in with Hadoop.

I don't know what the Hadoop will look like a year from now but with the speed at which open source is changing the landscape, we know that a year from now Hadoop will be used in ways we haven't even imagined yet. An old quote, "The race is not always to the swift, but to those who keep on running." For those that have jumped into the Hadoop highway you'd better keep running because things are not slowing down. :)

Monday, November 4, 2013

Starting HDP2 Services in the Right Order

One of the top ten issues new administrators have with Hadoop is starting and stopping Hadoop services in the right order. When starting and stopping a Hadoop 2 cluster (HDP2) use the following order to keep you between the yellow lines.

HDFS	Storage
YARN	Processing
ZooKeeper	Coordination Service
HBase	Columnar Database
Hive Metastore	Metadata
HiveServer2	JDBC Connectivity
WebHCat	Metadata Resources
Oozie	Workflow / Scheduler

Take care of your Hadoop cluster and it will take care of you. :)

Saturday, October 26, 2013

Demystifying Hadoop for Data Architects, DBAs, Devs and BI Teams

Introduction

I started doing Demystifying series on subjects such as database technologies, infrastructure and Java since back in the Oracle 8.0 days. The topics have ranged from Demystifying Oracle RAC, Demystifying Oracle Fusion, Demystifying MySQL, etc. So I guess it's time to Demystify Hadoop. J

Whether you are talking Oracle RAC, Oracle ExaData, MySQL, SQL Server, DB2, Teradata or Application Servers, it's really all about the data. Companies are constantly striving to make faster business decisions with higher degrees of accuracy. Traditional systems such as Oracle, SQL Server, IBM, Teradata, etc. are scaling their systems to store hundreds of terabytes and even petabytes, with hardware that keeps getting faster and faster. However these traditional systems were designed for transaction systems and have a lot of difficulties working with big data. I'm going to talk to you about why these traditional systems are not designed for big data and we're going to talk about how Hadoop is the right technology at the right time to address Big Data.

What's The Deal About Big Data

Across the board, industry analyst firms consistently report almost unimaginable numbers on the growth of data. The traditional data in relational databases and data warehouses are growing at incredible rates. The traditional data is a challenge by itself (show in Enterprise data below).
The big news though is VOIP, social media and machine data are growing at exponential rates and are completely dwarfing the data growth of traditional systems. Most organizations are learning that this data is just as critical to making business decisions as their traditional data. This non-traditional data is usually semi-structured and unstructured data. Examples of this data include web logs, mobile web, click stream, spatial and GPS coordinates, sensor data, RFID, video, audio and image data. The chart below shows the growth of non-traditional data (Machine Data, Social Media, VoIP) relative to traditional data (Enterprise Data). The source is the IDC.

Data becomes big data when the volume, velocity, and/or variety of data gets to the point where it is too difficult or too expensive for traditional systems to handle. Big data is not when when the data reaches a certain volume velocity of data ingestion or type of data. Big data is when traditional systems are no longer viable solutions due to the volume, velocity and/or variety of data. A good first book on big data to read is Disruptive Possibilities.

The Big Data Challenge

The reason traditional systems have a problem with big data is they were not designed for big data.

Problem - Schema-On-Write: Traditional systems are schema-on-write. Schema-on-write requires the data be validated when it is written. This means that a lot of work has to be done before new data sources can be analyzed. Here is an example of the problem with this. Let's say a company wants to start analyzing a new source of data from unstructured or semi-structure sources. A company will usually spend months (i.e. 3-6 months) designing schemas, etc. to store the data in a data warehouse. That is 3 - 6 months that they are not able to use the data to make business decisions. Then when the data warehouse design is complete 6 months later, often the data has changed again. If you look at data structures from social media, they change on a regular basis. The schema-on-write environment is too slow and non-flexible to deal with the dynamics of semi-structured and unstructured data. The other problem with unstructured data is traditional systems usually use BLOBs to handle unstructured data. Anyone that has worked with BLOBs for big data, would rather get their gums scraped than work with BLOB data types in traditional systems.
Solution - Schema-On-Read: Hadoop systems are schema-on-read. Which means any data can be written to the storage system immediately. Data is not validated until it is read. This allows Hadoop systems to load any type of data in, and begin analyzing it quickly. Hadoop systems have extremely short data latency compared to traditional systems. Data latency is the differential between data hitting the disk and the data being able to provide business value. Schema-on-read gives Hadoop a tremendous advantage over traditional systems in an area that matters most. Being able to analyze the data faster to make business decisions.
Problem - Cost of Storage: Traditional systems use SAN storage. As organizations start to ingest larger volumes of data, SAN storage is cost prohibitive.
Solution - Local Storage: Hadoop is able to use HDFS, a distributed file system that leverages local disks on commodity servers. SAN storage is about $1.20/GB while local storage is about $.04/GB per storage. Hadoop's HDFS creates three replicas by default for high availability. So at .12 cents per GB it is still a fraction of the cost of traditional SAN storage. As organizations are storing much larger volumes of data, the traditional SAN storage is too expensive to be a viable solution.
Problem - Cost of Proprietary Hardware:
Large proprietary hardware solutions can be cost prohibitive when deployed to process extremely large volumes of data. Organizations are spending millions of dollars in hardware and software licensing costs while supporting large data environments. Organizations are often growing their hardware in million dollar increments to handle the increasing data.
Solution: Commodity Hardware: People new to Hadoop do not realize that it is possible to build a high performance super computer environment using Hadoop. One customer was looking at a proprietary hardware vendor for a solution. The hardware vendor's solution was $1.2 million in hardware costs and $3 million in software licensing. The Hadoop solution for the same processing power was $400,000 for hardware, the software was free and the support costs were included. Since data volumes would be constantly increasing, the proprietary solution would be growing in $500k and $1 million dollar increments and the Hadoop solution would be growing in $10,000 and $100,000 increments.
Problem - Complexity: When you look at any traditional proprietary solution it is full of extremely complex silos of system administrators, DBAs, application server teams, storage teams and network teams. Often there is one DBA for every 40 - 50 database servers. Anyone running traditional systems knows that complex systems fail in complex ways.
Solution - Simplicity: Since Hadoop uses commodity hardware, it is a hardware stack that one person can understand. Numerous organizations running Hadoop have one administrator for every 1000 data nodes.
Problem - Causation: Because data is so expensive to store in traditional systems, data is filtered, aggregated and large volumes are thrown out due to the cost of storage. Minimizing the data to be analyzed reduces the accuracy and confidence of the results.
Solution - Correlation: Due to the relatively low cost of storage of Hadoop, the detailed records are stored in Hadoop's storage system HDFS. Traditional data can then be analyzed with non-traditional data to find correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation because the accuracy and confidence of the results is factors higher than traditional systems. An example, the Center for Disease and Control (CDC) used to take 28 - 30 days to identify an outbreak. The CDC had traditionally obtained data from doctors and hospitals. This data was then analyzed in large volumes and cross referenced sources in order to validate the data. The next step then was going back a number of years and correlating it with the data from social media sources such as Twitter and Facebook. They validated the accuracy of the correlation results going back years. Now using big data, the CDC can identify an outbreak in 5 - 6 hours. Organizations are seeing big data as transformational.
Problem - Bringing Data to the Programs: In relational databases and data warehouses, data is loaded usually in 8k - 16k data blocks into memory so programs can process the data. When you need to process 10s, 100s and 1000s of TB this model completely breaks down or is extremely expensive to implement.
Solution - Bringing Programs to the Data: With Hadoop, the programs are moved to the data. Hadoop data is spread across all the disks on the local servers that make up the Hadoop cluster in 128MB (default) increments. Individual programs, one for every block runs in parallel across the cluster delivering a very high level of parallelization and IOPS. Which means Hadoop systems can process extremely large volumes of data much faster than traditional systems at a fraction of the cost due to the architecture model.

Successfully leveraging big data is transforming how organizations are analyzing data and making business decisions. The "value" of the results of big data has most companies racing to build Hadoop solutions to do data analysis. The diagram below show how significant big data is. Often customers bring in Hortonworks and say, we need you to make sure we "out Hadoop" our competitors. Hadoop is not just a transformation technology it's the strategic difference between success and failure.

Examples of New Types of Data

Hadoop is being used by every type of organization ranging from Internet companies, Telecommunication firms, Banks, Credit Card companies, gaming companies, on-line retail companies,etc. Anyone that needs to analyze data is moving to Hadoop. Here are examples of data being processed by organizations.

Hadoop Distributions - The Hortonworks Data Platform (HDP)

A Hadoop distribution is made up of a number of different open source frameworks. An organization can build their own distribution from the different versions of each framework. Anyone running a production system needs an enterprise version of a distribution. Since Hortonworks has key committers and project leaders on the different open source framework projects, we use our expertise to pick the latest version of a framework that works reliably with the other frameworks. Hortonworks then goes out and tests a distribution and builds an enterprise distribution of Hadoop. For example, Hadoop 2 went GA the week of October 15th, 2013. Hadoop 2 has been running on Hadoop clusters with thousands of nodes since January of 2013, being tested by the large set of Hortonworks partners.

The Hortonworks distribution is called the Hortonworks Data Platform. The new GA release of Hadoop 2 by Hortonworks is called HDP 2. Hortonworks runs on a true open source model. Every line of code written by Hortonworks for Hadoop is given back to the Apache Software Foundation (ASF). When means every Hortonworks distribution is only a few patch sets off of the main Apache baseline. The result is HDP2 is extremely stable from a support perspective and protects an organization from vendor lock in. Here is an example of the HDP2 distribution and the key frameworks associated with it. Hortonworks builds it's reputation on the "enterprise" quality of it's distribution. The industry is recognizing the platform expertise of Hortonworks.

There are a number of different Hadoop distributions. Some of the distributions have been around longer than Hortonworks. In my expert opinion, the reason to choose Hortonworks is:

Platform Expertise - Hadoop is a platform that frameworks run on. Hortonworks' entire focus is on the enterprise platform for Hadoop. Hortonworks is not trying to be everything for everybody. Hortonworks focus is the Hadoop platform. Hortonworks has demonstrated this in a number of ways. Hadoop is open source and is developed as a community. Hortonworks is by far the largest contributor of source lines of code for Hadoop.
Defining the Roadmap: More and more large vendors, are seeing Hortonworks as defining the road map for Hadoop. Hortonworks while working with the open source community, has been a key leader in the design and architecture of YARN. YARN is the foundational processing component of Hadoop 2. The platform expertise demonstrated by Hortonworks is moving a number of the largest vendors in the world to move to the Hortonworks Data Platform (HDP). This is why you seen major vendors such as Microsoft and Teradata choosing HDP as the Hadoop distribution of choice.
Enterprise Version of Hadoop - Hortonworks is focused as being the definitive enterprise distribution of Hadoop.
Open Source - Hortonworks is based on an open source model. Every line of code created goes back into the Apache Software Foundation. Other distributions are proprietary or open source proprietary. The proprietary solutions create vendor lock in which more and more companies are trying to avoid. With Hortonworks contributing all code back to the Apache Software Foundation it minimizes support issues.
Windows and Linux - HDP is the only Hadoop distribution that runs on Linux and Windows.

The two main frameworks of Hadoop are the Hadoop Distributed File System (HDFS) which provides the storage and I/O and YARN with is a distributed parallel processing framework.

YARN
YARN (Yet Another Resource Negotiator) is the foundation for parallel processing in Hadoop. YARN is:

Scaleable to 10,000+ data node systems.
Supports different types of workloads such as batch, real-time queries (Tez), streaming, graphing data, in-memory processing, messaging systems, streaming video, etc. You can think of YARN as a highly scalable and parallel processing operating system that supports all kinds of different types of workloads.
Supports batch processing providing high throughput performing sequential read scans.
Supports real time interactive queries with low latency and random reads.

HDFS 2
HDFS uses NameNodes (master servers) and DataNodes (slave servers) to provide the I/O for Hadoop. The NameNodes manage the meta data. NameNodes can be federated (multiples) for scalability. Each NameNode can have a standby NameNode for failover (active-passive). All the user data is stored on the DataNodes. Data is distributed across all the disks in 128MB - 1GB block sizes. The data has 3 replicas (default) for high availability. HDFS provides a solution similar to striping and mirroring using local disks.

Additional Frameworks
Here is a summary of some of the key frameworks that make up HDP 2.

Hive - A data warehouse infrastructure than runs on top of Hadoop. Hive supports SQL queries, star schemas, partitioning, join optimizations, caching of data, etc. All the standard features you'd expect to have in a data warehouse. Hive lets you process Hadoop data using a SQL language.
Pig - A scripting language for processing Hadoop data in parallel.
MapReduce - Java applications that can process data in parallel.
Ambari - An open source management interface for installing, monitoring and managing a Hadoop cluster. Ambari has also been selected as the management interface for OpenStack.
HBase - A NoSQL columnar database for providing extremely hast scanning of column data for analytics.
Scoop, Flume and WebHDFS - tools providing large data ingestion for Hadoop using SQL, streaming and REST API interfaces.
Oozie - A workflow manager and scheduler.
Zookeeper - A coordinator infrastructure
Mahout - a machine learning library supporting Recommendation, Clustering, Classification and Frequent Itemset mining.
Hue - is a Web interface that contains a file browser for HDFS, a Job Browser for YARN, an HBase Browser, Query Editors for Hive, Pig and Sqoop and a Zookeeper browser.

Hadoop - A Super Computing Platform

Hadoop is a solution that leverages commodity hardware to build a high performance super computing environment. Hadoop contains master nodes and data nodes. HDFS is the distributed file system that provides high availability and high performance. HDFS is made up of a number of data nodes that break a file into multiple blocks. The block sizes are usually in 128MB - 1GB in size. Each block is replicated for high availability. YARN is a distributed processing architecture than can distribute the work load across the data nodes in a Hadoop cluster. People new to Hadoop do not realize the massive amount of IOPS that commodity X86 servers can generate with local disks.

In the diagram below:

HDFS - distributes data blocks across all the local disks in a cluster. This allows the cluster to leverage the IOPS that local disks can generate across all the local servers. When a process needs to run, the programs are distributed in parallel across all the data nodes to create an extremely high performance parallel environment. Without looking into the details, the main point is this is a super computer environment that can leverage parallelization for processing and leverage the massive amounts of IOPS that local disks can generate running across multiple data nodes as a distributed file system. The diagram shows multiple parallel processes running across a large volume of local disks running as a single distributed file system.

Hadoop is linearly scalable with commodity hardware. If a Hadoop cluster cannot handle the workload, an administrator can add some data node servers using local disks to increase processing and IOPS. Hadoop is linearly scalable at commodity hardware pricing.

Summary - Demystifying Hadoop

Hadoop is not replacing anything. Hadoop has become another component in an organizations enterprise data platform. This diagram shows that Hadoop (Big Data Refinery) can ingest data from all types of different sources. Hadoop then interacts and has data flows with traditional systems that provide transactions and interactions (relational databases) and business intelligence and analytic systems (data warehouses).

Wednesday, August 14, 2013

Hadoop Reference Architectures

Some initial key factors for success when building a Hadoop cluster is to build a solid foundation. This includes:

Selecting the right hardware. In working with a hardware vendor make sure you are working from their hardware compatibility lists (HCL) and you are making the right decisions for your cluster. Commodity hardware does not have to be generic. You can select commodity hardware that is customized for running Hadoop.
Build an enterprise OS platform. Whether you are using Linux or Windows, customize and tune your operating system using enterprise best practices and standards for Hadoop.
Design your Hadoop clusters using reference architectures. Don't reinvent the wheel unless you have to. Vendors are publishing Hadoop reference architectures that give you a great starting place.

I've included a few HDP reference architectures to give you a feel for what a Hadoop platform may look like.

I'd like to keep building this reference architecture list.

Saturday, June 29, 2013

Weaknesses in Traditional Data Platforms

Everyone understands that Hadoop brings high performance commercial computing to organizations using relatively low cost commodity storage. What is accelerating the move to Hadoop are weaknesses in traditional relational and data warehouse platforms in meeting today's business needs. Some key weaknesses of traditional platforms include:

Late binding with schemas greatly increase the latency between receiving new data sources and deriving business value from this data.
The significantly high cost and complexity of SAN storage. This high cost forces organizations to aggregate and remove a lot of data that contains high business value. Important details and information are getting thrown out or hidden in aggregated data.
The complexity of working with semi-structured and unstructured data.
The incredible cost, complexity and ramifications of maintaining database administrators, storage and networking teams in traditional platforms. There are lots of silos of expertise and software required in traditional environments that have dramatic effects on agility and cost. It's gotten to the point that vendors are now delivering extremely expensive engineered systems to deal with the complexity of these silos. These expensive engineered systems require even more specialized expertise to maintain and make customers ever more dependent on the vendors. What's funny is you hear the old phrase "one throat to choke but it's the customer whose choking on the cost. With Hadoop's self-healing and fault tolerance a small team can manage thousands of servers. A single Hadoop administrator can manage 1000 - 3000 nodes all on relatively inexpensive commodity hardware.

While the above highlights the need for Hadoop, it's also important to understand traditional relational databases and data warehouses still have the same role and are needed. A relational database provides a completely different function that a Hadoop cluster. Also, a company is not going to throw out all their existing data warehouses or the expertise and reporting they've built around them. Hadoop today is usually used to add new capabilities to an enterprise data environment not replace existing platforms.

The old line of no one ever gets fired for buying IBM is a thing of the past with Hadoop. An entire organization may go under if your competition is effectively using big data and you are not. Hadoop is the most disruptive technology since the .com days.

Friday, June 28, 2013

Hadoop Summit 2013 in San Jose

It has been a privilege to present at the Hadoop Summits this year in Amsterdam and San Jose. This week was one of the best networking weeks I've ever had at a conference. Great seeing all my Oracle, VMware, Rackspace and MySQL friends as well as meeting a lot of new friends in the Hadoop ecosystem.

Key takeaways:

Hadoop's disruption of the IT industry is accelerating.
Hadoop 2.0 will significantly increase enterprise adoption.
YARN is the distributing operating system of the future.
Incredible success stories of the ROI around Hadoop.
Open source community is about innovation, community and sharing.
Lots of analytics software competing to run on Hadoop. This will be the big battleground.
Hortonworks reinforces it's innovation and leadership in defining the roadmap for Hadoop.
Hortonworks constantly demonstrated their platform expertise with Hadoop.
Hadoop is a high performance commercial computing environment.

Two coolest things I liked at the conference:

An 8-node Raspberry PI Hadoop cluster.
Creating a multi-node (VM) Hadoop cluster on your laptop using the Hortonworks Sandbox.

I was also able to barter for a Yahoo soccer ball (all it cost me was a Hortonworks water bottle). :)

Wednesday, June 26, 2013

Hadoop - It's All About The Data

A key point to understand about Hadoop is that it's all about the data. Don't lose focus. It's easy to get hung up on Hive, Pig, HBase, HCatalog and lose sight of designing the right data architecture. Also, if you have a strong background in data warehouse design, BI, analytics, etc. all those skills are transferable to Hadoop. Hadoop just takes data warehousing to new levels of scalability and agility with reduction of business latency while working with data sets ranging from structured to unstructured data. Hadoop 2.0 and YARN are going to move Hadoop deep into the enterprise and allow organizations to make faster and more accurate business decisions. The ROI of Hadoop is multiple factors higher than the traditional data warehouse. Companies should be extremely nervous about being out Hadooped by their competition.

Newbies often look at Hadoop with wide eyes versus understanding that Hadoop has a lot of components to it that they already understand such as: clustering, distributed file systems, parallel processing, batch and stream processing.

A few key success factors for a Hadoop project are:

Start with a good data design using a scalable reference architecture.
Building successful analytical models that provide business value.
Be aggressive in reducing the latency between data hitting the disk and leveraging business value from that data.

The ETL strategies and data set generation for Hadoop is similar to what you are going to want to do in your Hadoop cluster. It's important to look at your data warehouse and understanding how your enterprise data strategy is going to evolve with Hadoop now a part of the ecosystem.

"Hadoop cannot be an island, it must integrate with Enterprise Data Architecture". - HadoopSummit

"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network." - Merv Adrian at Gartner Research

This is a sample Hadoop 1.x cluster so you can see the key software processes that make up Hadoop. The good point of this diagram is that if you understand it you are probably worth another $20-30k. :)

YARN (Hadoop 2.0) is the distributed operating system of the future. YARN allows you to run multiple applications in Hadoop, all sharing a common resource management. YARN is going to disrupt the data industry to a level not seen since the .com days.

A Hadoop cluster will usually have multiple data layers.

Batch Layer: Raw data is loaded into a data set that is immutable so it becomes your source of truth. Data scientists and analysts can start working with this data as soon as it hits the disk.
Serving Layer: Just as in a traditional data warehouse, this data is often massaged, filtered and transformed into a data set that is easier to do analytics on. Unstructured and semi-structured data will be put into a data set that is easier to work with. Metadata is then attached to this data layer using HCatalog so users can access the data in the HDFS files using abstract table definitions.
Speed Layer: To optimize the data access and performance often additional data sets (views) are calculated to create a speed layer. HBase can be used for this layer dependent on the requirements.

This diagram emphasizes two key points:

The different data layers you will have in your Hadoop cluster.
The importance of building your metadata layer (HCatalog).

With the massive scalability of Hadoop, you need to be able to automate as much as possible and manage the data in your cluster. This is where Falcon is going to play a key role. Falcon is a data lifecycle management framework that provides the data orchestration, disaster recovery as well as data retention you need to manage your data.