tag:blogger.com,1999:blog-85855957298149836452024-03-25T22:55:03.864-06:00A DBA's Journey into the Cloud and Big DataBlog focusing on Big Data, Hadoop, Virtualization and the Cloud.Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.comBlogger76125tag:blogger.com,1999:blog-8585595729814983645.post-6234394395761844532014-09-21T12:16:00.000-06:002014-09-24T09:01:09.490-06:00The Evolving Enterprise Solution for Data<div class="p1">
<b>Innovation Around Big Data is Creating Choice</b><br />
The Modern Enterprise Big Data Platform has been referred to by many names. Names like Modern Data Lake, Enterprise Data Hub, Marshal Data Yard and Virtual Data Lake to name a few. Each name is associated with a defining characteristic, philosophy or goal. Big data platforms are evolving at an amazing speed due in large part to the interest around big data as well as the innovation of open source. This innovation is creating a lot of choice as well as a lot of confusion. The decisions are not easy around the choice of distributions, frameworks, reference architectures, NoSQL databases, real-time access, data governance, etc.<br />
<br />
<b>A Blended Solution around Data?</b><br />
Hadoop and NoSQL are adding functionality that currently exists in RDBMS and EDW platforms. RDBMS and EDW platforms are adding feature and functionality that exists in Hadoop and NoSQL as well as adding connectors that support data integration with big data platforms. Map Reduce applications or R scripts can run in some relational databases. It’s now possible to execute a join where some of the data resides in a RDBMS/EDW and other data resides in Hadoop or NoSQL. Where should the data reside? Who should own the SQL statement. The Modern Enterprise Data Platform is not a static platform. It is instead a platform that is taking on new forms and functionality. Organization needs to look on how to design a flexible enterprise environment that can leverage the features and functionality of all data platforms and meet the current/future needs of the organization. </div>
<div class="p2">
<br />
<b>Data Needs to be Consumable and Actionable</b></div>
<div class="p1">
The problems to be solved are not just around Hadoop, NoSQL, NewSQL, RDBMS, EDWs or even about the data. The goal is to improve decision making and business insight. Organizations need to be able to make business decisions faster, improving the accuracy and reducing the risk of business decisions. To be able to handle the data volume, velocity and variety for data cost-effectively and efficiently. The management and governance of data needs to take into consideration the evolution of these data platforms and how to ensure the data is consumable and actionable. </div>
<div class="p2">
<br /></div>
<b>Improving Business Insight</b><br />
<div class="p1">
Increasing business insight by improving analytics is one of the goals of big data. One step in achieving this goal is by reducing the amount of data silos. It’s also important to make sure we do not rebuild the data silos in big data platforms. Be aware, the core designs for a RDBMS, EDW, Hadoop cluster and NoSQL database platforms were created for different reasons. A Hadoop cluster is not ACID compliant, a NoSQL database is not relational and an RDBMS cannot scale at cost the way a Hadoop cluster can. One needs to look at the key business goals and use cases to leverage the features of all the data platforms to achieve the strategic goals around data. </div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-53731227007011589872014-04-11T12:45:00.001-06:002014-09-26T13:37:59.943-06:00Collaborate 2014 - What you Need to Know - Emerging skills<b>Collaborate 2014 Las Vegas - IOUG, OAUG and Quest</b><br />
The Collaborate 2014 conference brings together three of the key Oracle user group communities the IOUG (Independent Oracle Users Group), the OAUG (Oracle Application Users Group) and Quest (PeopleSoft, JD Edwards). The exhibition hall and overall attendance seemed to be up and there was plenty of energy at the conference. I focus on the technology side so I will share my thoughts and insights from the IOUG side of the conference.<br />
<br />
<b>Cloud, Big Data and NoSQL </b><br />
From the database side there were top speakers like Rich Niemiec, Charles Kim, Arup Nanda, Tony Jambu, etc. rocking the house as always and delivering great insight, knowledge sharing and showing vision and direction. From my perspective there were three areas that had the most energy and interest as well as creating the most buzz between sessions. The three hot areas were Cloud, Big Data and NoSQL. These are also three areas that most Oracle people have a lot to learn. Everyone was looking at getting a much better understanding of these three areas, the roadmaps, as well as how these three areas will impact attendees current positions as well as their futures.<br />
<br />
Despite excellent presentations in the Cloud, Big Data and NoSQL, attendees realized that there is a lot to learn in these areas. In the cloud space someone needs to learn the cloud business model, driving factors, goals and objectives, orchestration, deployment models as well as the skill sets and best practices around cloud solutions. In the big data area, everyone seen that it takes time to learn big data, the Hadoop platform, data architecture, ETL models as well as skills and best practices. In the NoSQL area attendees learned there are a number of different NoSQL solutions available, all with different features and functionality. Everyone also seen that the cloud, big data and NoSQL are areas that are constantly evolving and maturing at rates much faster than in Oracle major releases. <br />
<br />
Oracle has always evolved to meet market and customer needs. These areas are a little different. Oracle DBAs and Developers realize the importance of acquiring skills in RAC, Replication, RMAN, Java, Fusion Middleware, OEM, Exadata, etc. There are also a number of key products coming from strategic vendors. The Shareplex Connector for Hadoop can be used to load data directly from Oracle into HDFS or HBase. The Shareplex Connector will replicate data near real-time (HDFS) or real-time (HBase). <br />
<br />
Cloud, Big Data and NoSQL are extremely valuable and marketable future skills to acquire since there is so much current and future demand. As in evolution of IT, there are always the visionaries, early adopters and evangelists. We are seeing the next generation of skill sets that will be in demand emerge from these areas at this conference. I also find it interesting that just like RAC and Exadata, etc in their infancy drew the top leaders to them because the top leaders in the Oracle user community seen the future of Hadoop, NoSQL and the Cloud. The top RAC and Exadata people have always been knowledge experts with infrastructure, architecture and understanding the business. It is the infrastructure, architecture and business experts in the Oracle user community that are gravitating to these emerging areas. <br />
<br />
I believe the cloud, big data and NoSQL areas are going to create a lot of energy in the Oracle user community in the next year. I look forward to the discussions with the user community moving forward.<br />
<br />
<br />
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-59047165408270739012014-04-05T10:08:00.002-06:002014-04-05T10:24:39.266-06:00Cloud Sessions at Collaborate 2014 in Las Vegas<span style="color: blue;"><b>Great Time to Get Involved in Cloud SIG</b></span><br />
Oracle Cloud solutions are one of the hottest topics coming into the Collaborate Oracle conference staring April 7, 2014 in Las Vegas. The IOUG Cloud Computing SIG will be supporting a lot of the cloud activities this week at the conference. Now is a great time to join the IOUG Cloud Special Interest Group (SIG). Join the Cloud SIG by clicking <a href="http://www.ioug.org/sigs">here</a>.<br />
<br />
<b><span style="color: blue;">Cloud SIG Meeting at Collaborate</span></b><br />
The IOUG Cloud SIG is growing significantly. Now is a great time to join and get involved. The I<b>OUG Cloud SIG meeting is April 8, 2014 at 12:30 - 1:00PM in Level 3, Toscana 3701</b>. <br />
<br />
<b><span style="color: blue;">Cloud SIG Answering Cloud Services and DBaaS Questions</span></b><br />
Oracle is offering a lot of different cloud solutions across it's products. The IOUG Cloud SIG will help answer the following questions for you. What in Database-As-A-Service (DBaaS) mean? Does DBaaS have to only have a consolidation play? Does DBaaS involve self-provisioning/rapid provisioning? Does DBaaS mean that virtualization has to be involved? Does DBaaS mean that we have to consolidate multiple databases in a single operating system? Does DBaaS mean that we have provide isolation for certain databases? Does DBaaS have to have showback and chargeback? Can we do DBaaS with schema level consolidation? Does DBaaS involve Exadata? Does DBaaS involve Oracle RAC? With the introduction of Oracle Database 12c and the new Pluggable Database Option, does DBaaS mean that we have to leverage Pluggable Databases? Does DBaaS mean that we have to have someone else host it for us? Can it be done in my own data center? The IOUG Cloud SIG addresses understanding the different options and the different ways of deploying DBaaS. <br />
<br />
We are looking for new leaders, speakers for webinars, writers for use cases, volunteers and enthusiasts. One of the best things about user groups is the networking, interaction and sharing of knowledge. We look forward to meeting you. <br />
<br />
<b><span style="color: blue;">Cloud Sessions at Collaborate 2014</span></b><br />
Here is a list of Cloud sessions at Collaborate. When you look at these sessions you will see Oracle industry leaders in Exadata, RAC, Oracle Applications and Big Data. The user community leadership around Oracle Cloud solutions shows that cloud solutions span all Oracle platforms.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQD2cI0nBaEsUDr_WtWnf8pXZm8hEBA12ZvaTjyu3vE9ksS-zJD4lH1Z_WM1odoJG6lVUC8OAZNFK9lfjQowC0I6tkSnH8pp2PttNoSp_YMi3CpPzEQtmQcow6dqhEQUMg_gipxMQNGiRU/s1600/Screen+Shot+2014-04-05+at+10.04.33+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQD2cI0nBaEsUDr_WtWnf8pXZm8hEBA12ZvaTjyu3vE9ksS-zJD4lH1Z_WM1odoJG6lVUC8OAZNFK9lfjQowC0I6tkSnH8pp2PttNoSp_YMi3CpPzEQtmQcow6dqhEQUMg_gipxMQNGiRU/s1600/Screen+Shot+2014-04-05+at+10.04.33+AM.png" height="300" width="640" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-34298961757943264432014-04-05T04:59:00.001-06:002014-04-05T09:14:28.877-06:00Open Source Driving Innovation of Enterprise Hadoop<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEFCmOdMLVYgQBmEHU9EGCZVV0ELj-RnjbGaaibzH8-GtBoxsqcB-hPSARlSq9DFaNO2UD-OIZP8s9JzhCX_J1qyX2D-Rh2VmE1YuAGM3sYpATcRQTSmQJwI1jljq13lHNXNaeYqmPWMZ1/s1600/HortonworksElephant.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhEFCmOdMLVYgQBmEHU9EGCZVV0ELj-RnjbGaaibzH8-GtBoxsqcB-hPSARlSq9DFaNO2UD-OIZP8s9JzhCX_J1qyX2D-Rh2VmE1YuAGM3sYpATcRQTSmQJwI1jljq13lHNXNaeYqmPWMZ1/s1600/HortonworksElephant.jpg" /></a></div>
<br />
In the last seven months we have seen a tremendous level of innovation and maturity in the enterprise Hadoop platform. Hortonwork's HDP 2.0 and <span style="color: #3d85c6;"><a href="http://hortonworks.com/hdp/whats-new/">HDP 2.1</a></span><span style="background-color: white;"> </span>releases are showing the tremendous innovation being driven by open source today. This innovation is significantly improving the enterprise capabilities of Hadoop and is changing the landscape of Hadoop. It is difficult for proprietary releases of Hadoop to compete with the hundreds of thousands of lines of code being written by the Hadoop <a href="http://hadoop.apache.org/who.html"><span style="color: #3d85c6;">open source community</span></a>. Organizations ranging from Microsoft to Yahoo are adding their expertise and knowledge to the open source community. We are seeing proprietary and open source/proprietary solutions of Hadoop be put under tremendous pressure by the innovation of open source and seeing Hadoop distributions that are not 100% open source begin to disappear.<br />
<br />
With HDP 2.0 and 2.1 there are a number of game changing capabilities added to Hadoop. These new releases have added comprehensive capabilities in areas such as scalability, multi-tenancy, performance, security, data lifecycle management, data governance, encryption, interactive query, high availability and fault tolerance. Key additions include:<br />
HDP 2.0:<br />
<ul>
<li>YARN - a distributed data operating system supporting applications with different run time. characteristics. YARN also adds scalability and improved fault tolerance to Hadoop.</li>
<li>NameNode High Availability.</li>
<li>Hadoop scalability to 10,000+ nodes.</li>
<li>New releases of Hadoop frameworks in key areas such as Hive and HBase. </li>
</ul>
HDP 2.1:<br />
<ul>
<li>Interactive query capability in Hadoop. The Stinger project has increased the performance of interactive queries by 100 times with Hive optimization, container optimization, Tez integration and in-memory cache</li>
<li>Hive has improved SQL compliance. </li>
<li>Perimeter security added to Hadoop with Knox. Enterprise Hadoop offers authorization, authentication and encryption. </li>
<li>Data Lifecycle Management and data governance with Falcon.</li>
<li>Enhanced HDFS security and multi-tenancy capabilities.</li>
<li>Resource Manager High Availability</li>
<li>NameNode Federation improving scalability and multi-tenancy and stronger support of different run time characteristics. </li>
<li>Linux and Windows releases synched.</li>
<li>HDP search with Apache Solr increases the capabilities of Hadoop.</li>
<li>Storm providing scalability streaming to Hadoop.</li>
<li>Spark is available under Tech Preview to provide real time in-memory processing.</li>
</ul>
<div>
Ambari:</div>
<div>
<ul>
<li>Splitting the management interface Ambari with the HDP distribution. The management tool and the Hadoop software distribution can be rev'd separately.</li>
<li>Support of software stacks Storm, Tez and Falcon.</li>
<li>Maintenance mode silences alerts for services, hosts and components for administration work.</li>
<li>Rolling restarts.</li>
<li>Service and component restarts.</li>
<li>Support of zookeeper configurations.</li>
<li>Supports decommissioning of NodeManagers and RegionServers.</li>
<li>Ability to refresh client-only configurations</li>
</ul>
</div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-2214455669704956722014-03-15T07:34:00.002-06:002014-04-05T14:53:48.164-06:00Succeeding with Big Data Projects: The Secret SauceThe architectures and software frameworks being used for big data projects are constantly evolving. Modern data lakes are consistently using Storm for real-time streaming, NoSQL databases like HBase, Accumulo and Cassandra for low-latency data access and Kafka for message processing. Open source software such as Centos, MySQL, Ganglia and Nagios are making deeper penetration in large enterprises. I am also seeing Python and JavaScript becoming more popular. Linux containers and Docker are being looked at in the future to increase hardware consolidation and utilization. <br />
<br />
The <a href="http://gigaom.com/2014/03/04/why-the-internet-of-things-is-big-datas-latest-killer-app-if-you-do-it-right/">Netflix data architecture </a>is reflective of the design patterns organizations are looking at.<br />
<br />
Over the next two years we will see a blending of SQL and NoSQL databases. The Stinger project (Hive optimization and Tez) have brought interactive query capability to the batch processing environments of Hadoop. Which means the way organizations are using Hadoop is changing quickly as well. Real-time query and ACID capabilities are next in the list of customer requests. As data lakes are defining the modern data architecture platform and more and more data gets stored in Hadoop, organizations are wanting to use data in lots of different ways. <br />
<br />
Successful Big data projects have consistent patterns of success (the secret sauce). The technical infrastructure teams will be able to work with vendors to get the right hardware, stand up big data platforms and maintain them. However, big data projects can easily become science projects if the following is not addressed.<br />
<ul>
<li>Thought leadership that creates cultural change so an organization can innovate successfully. Big data is about making better business decisions faster with higher degrees of accuracy. A sense of urgency needs to exist.</li>
<li>An environment of collaboration and teamwork with everyone believing in a vision. The modern data lake helps to eliminate a lot of the technology and data silos that exist across different platforms and business units. Successful big data project environments eliminate the social, territorial and political silos that often exist in traditional teams. </li>
<li>A strong emphasis in data/schema design and ETL reference architectures. It's still all about the data. :) </li>
<li>The ability to build a plane while flying it. Big data technologies, environments, frameworks and methodologies are evolving quickly. Organizations need to be able to adapt and learn fast. </li>
</ul>
"Extinction is the rule. Survival is the exception." was a quote from Carl Sagan. Being able to transform an organization into big data is one the biggest challenges an organization faces. Everyone is concerned about the development of the technical skills to succeed with big data, however the development of the internal people is just as important. Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-90010494855886910962014-02-14T09:20:00.003-07:002014-04-05T09:17:45.253-06:00Oracle Database Infrastructure as a Service HandbookThe Oracle, <i>Database Infrastructure as a Service Handbook</i> has been in the making for the last three years. Charles, Steve and I have been key evangelists in the promotion of virtualizing tier one platforms on-premise for the last three years. We are taking all our presentations, insights, experiences and best practices and putting them in the book.<br />
<br />
<ul>
<li>Charles Kim - Viscosity NA, Oracle ACE Director, VCP, vExpert, known as "Mr. CNN" in the VMware ecosystem as a key person to bring into extremely high profile Oracle environments that are being considered for virtual and cloud platforms.</li>
<li>George Trujillo - Hortonworks, VCP, vExpert, Double Oracle ACE, former Tier One Specialist for VMware Center of Excellence team.</li>
<li>Steven Jones - VMware, VCP, internationally recognized expert in VMware infrastructures. </li>
<li>Sudhir Balasubramanian - VMware vExpert, specializes in the deployment of Oracle infrastructures on virtual platforms.</li>
</ul>
<br />
The <i>Oracle Database Infrastructure as a Service Handbook</i> is in final review before it is released. We look forward to announcing the book is available.<br />
<br />
<br />
<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTcNBnbVB2XReE8Mwz1PoxY6taD7gtExLVlNScpebYaT06KA5OZSqG7ha0qJde_SwwcNt2cPqClge5yWCjHocg9XPsZEagcOEDucqcVxTO9tYuR07Ndpw2z9Hg1xHTUxdFV8t4EyKNzxfa/s1600/OracleDBaaS_bookcover.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTcNBnbVB2XReE8Mwz1PoxY6taD7gtExLVlNScpebYaT06KA5OZSqG7ha0qJde_SwwcNt2cPqClge5yWCjHocg9XPsZEagcOEDucqcVxTO9tYuR07Ndpw2z9Hg1xHTUxdFV8t4EyKNzxfa/s1600/OracleDBaaS_bookcover.png" height="400" width="306" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Database Infrastructure as a Service</td></tr>
</tbody></table>
<br />Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-3175062799485123552014-01-25T13:56:00.001-07:002014-04-05T05:20:36.138-06:00Choosing MySQL or Oracle for your Hadoop RepositoriesWhen setting up a Hadoop cluster the administration team has to decide which relational databases to use for the Hadoop metadata repositories. I strongly recommend that one type of relational database be used for all the repositories instead of using different database vendors for different frameworks.<br />
<br />
Hadoop requires metadata repositories (relational databases) for Ambari (management), HiveServer2 (SQL), Oozie (scheduler and workflow tool) and Hue (Hadoop UI). Choices include Postgres, MySQL, Oracle or derby. The databases holding the Hadoop metadata repositories have to be backed up and maintained like any other database server.<br />
<br />
I recommend using MySQL for the following reasons:<br />
<ul>
<li>Oracle is too heavyweight of a database server that it's full resources will not be utilized. The Oracle database server will take extra memory, disk space and CPU that will not be taken advantage of.</li>
<li>Postgres is a good solid database but it has no tipping point. I do not see a lot of Postgres databases when I go to customers and I do not see Postgres increasing in the market.</li>
<li>Derby (used with Ozzie) and SQLite (used with Hue) are not robust enough to be used in a heavy production environment. I would only use these databases if I was going to create a small Hadoop cluster for personal development.</li>
</ul>
MySQL has a lot of features that make it ideal as a database repository for different Hadoop frameworks. They include:<br />
<ul>
<li>Extremely fast and lightweight.</li>
<li>Relatively easy to administer and backup.</li>
<li>Replication is very easy to set up and maintain.</li>
<li>MySQL has extremely high adoption and it is easy to find resources to manage it.</li>
</ul>
If Oracle is the corporate standard and the database and Hadoop administration team prefer to use Oracle, I have provided links for setting up Oracle for the primary Hadoop frameworks.<br />
<div class="p1">
</div>
<ul>
<li>Main documents page: <span class="s2"><a href="http://docs.hortonworks.com/">http://docs.hortonworks.com/</a></span></li>
<li>Reference Guides - Supported Database Matrix: <span class="s2"><a href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_reference/content/db-support-matrix.html">http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_reference/content/db-support-matrix.html</a></span></li>
<li>Oracle Steps for Ambari: <span class="s2"><a href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-3.html">http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-3.html</a></span></li>
<li>Oracle Steps for Hive: <span class="s2"><a href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-1.html">http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-1.html</a></span></li>
<li>Oracle for Oozie: <span class="s2"><a href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-2.html">http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/ambari-chaplast-2.html</a></span></li>
</ul>
<br />
<br />
<div class="p1">
<br /></div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-10096380809544728062014-01-04T11:29:00.001-07:002015-07-30T19:11:37.098-06:00How to Learn YARN and Hadoop 2I previously wrote a blog on <a href="http://tinyurl.com/bgogsdf">How to Learn Hadoop</a> that got a lot of positive feedback. I've been getting a number of requests to update it for how to learn YARN and Hadoop 2. Everyone wants to learn the cool secrets and tricks but knowledge always starts with learning the fundamentals. My recommendations here are meant for the reader who is serious about learning Hadoop.<br />
<br />
<b>Learn the Basic Concepts First</b><br />
To understand Hadoop you have to start by understanding Big Data.<br />
<ul>
<li><i><a href="http://www.amazon.com/s/ref=nb_sb_ss_i_1_15?url=search-alias%3Daps&field-keywords=disruptive%20possibilities%20how%20big%20data%20changes%20everything&sprefix=disruptive+poss%2Caps%2C178">Disruptive Possibilities: How Big Data Changes Everything</a>,- </i>This is a must read for anyone getting started in Big Data. It is an easy read, short and lays the foundation for Big Data and emphasizes why Hadoop is needed. </li>
</ul>
<div>
These are foundational whitepapers that explain the reasons behind the processing and distributed storage for Hadoop. These are not easy reads but when you get through them it will really help all your future learning around Hadoop because they have defined the "context for Hadoop". Some of the papers are older papers, but the core concepts the papers are discussing and the reasons behind them contain invaluable keys for understanding Hadoop.<br />
<ul>
<li><a href="https://54e57bc8-a-62cb3a1a-s-sites.googlegroups.com/site/2013socc/home/program/a5-vavilapalli.pdf?attachauth=ANoY7cq-loM3Z92Mzt-paa1kFpcK32YjK4UEx8uCoYrtWKX3IAPaURJPtd_ybfBDcFAQAICnxbsuNCspYVfJAYasTRcUvMydRMgll-Es_c2lB9HEL3bRaf9CqVyHqOCiBUdbQ5wvSzErf5U9mNRVfrd05JGC4Wf_Wbs1bDkuewr33CB8gkXirjQzRcbdJXMSKDLBUCzh-_p1sZm4haIaRKapVzj7Q7XCikFUEA84Wj2EhRUS30kLroE%3D&attredirects=0">Apache Hadoop YARN: Yet Another Resource Negotiator</a></li>
<li><a href="http://tinyurl.com/llweodb">MapReduce: Simplified Data Processing on Large Clusters</a> </li>
<li><a href="http://storageconference.org/2010/Papers/MSST/Shvachko.pdf">The Hadoop Distributed File System</a></li>
<li><a href="http://infolab.stanford.edu/~ragho/hive-icde2010.pdf">Hive - A Petabyte Scale Data Warehouse using Hadoop</a></li>
</ul>
<ul>
</ul>
<b>You Have to Understand the Data</b></div>
<div>
Hadoop clusters are built to process and analyze data. A Hadoop cluster becomes an important component in any enterprises data platforms, so you need to understand Hadoop from a data perspective.<br />
<ul>
<li><i><a href="http://www.amazon.com/Big-Data-Principles-practices-scalable/dp/1617290343/ref=sr_1_1?s=books&ie=UTF8&qid=1395634273&sr=1-1&keywords=Big+data+nathan+marz">Big Data</a>,</i> by Nathan Marx - The book does a great job of teaching core concepts, fundamentals and provides a great perspective of the data architecture of Hadoop. This book will build a solid foundation and helps you understand the Lambda architecture. You may need to get this book from MEAP if it has not released yet (http://www.manning.com/marz/). This book is scheduled for print release on March 28, 2014. Any DBA, Data Architect or anyone with a background in data warehousing and business intelligence should consider this a must read.</li>
</ul>
</div>
<div>
<b>Additional Reading</b></div>
<div>
We are in a transition period with Hadoop. Most of the books out today are on Hadoop 1.x and MapReduce v1 (classic). Hadoop 2 is GA, the distributed processing model is YARN and Tez will be an important part of processing data with Hadoop in the future. There are not a lot of books out yet on YARN and Hadoop 2 frameworks. You'll need to spend some time with the Hadoop documentation. :)</div>
<div>
<ul>
<li><i><a href="http://www.amazon.com/Apache-Hadoop-YARN-Processing-Addison-Wesley/dp/0321934504/ref=sr_1_1?s=books&ie=UTF8&qid=1380731696&sr=1-1&keywords=Hadoop+YARN">Apache Hadoop Yarn</a></i>, by Arun Murthy, Jeffrey Markham, Vinod Vavilapalli, Doug Eadline</li>
<li><i>Hadoop Mapreduce v2 Cookbook (2nd Edition)</i></li>
<li><i>Hadoop The Definitive Guide (4rd Edition)</i>, by Tom White </li>
</ul>
<b>Getting Hands on Experience and Learning Hadoop in Detail</b></div>
<div>
A great way to start getting hands on experience and learning Hadoop through tutorials, videos and demonstrations is with Virtual Machines available from the different Hadoop Distribution vendors. These virtual machines or sandboxes are an excellent platform for learning and skill development. The tutorials, videos and demonstrations will be updated on a regular basis. The sandboxs are usually available in a Virtualbox, Hyper-V or VMware virtual machine. An additional 4GB of RAM and 2GB of storage is recommended for the virtual machines. If you have a laptop that does not have a lot of memory you can go to the VM settings and cut the RAM for the VM down to about 1.5 - 2GB of RAM. This is likely to impact performance of the VM but it will help it at least run on a minimal configured laptop.<br />
<ul>
<li><a href="http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html">Cloudera VM</a></li>
<li><a href="http://hortonworks.com/hdp/downloads/">Hortonworks VM</a></li>
<li><a href="https://www.mapr.com/products/mapr-sandbox-hadoop">MapR VM</a></li>
<li><a href="http://blog.pivotal.io/pivotal/products/pivotal-hd-ga">Pivotal VM</a></li>
<li><a href="https://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/download.html">IBM BigInsights VM</a></li>
</ul>
<b>Other books to consider:</b></div>
<div>
<div style="direction: ltr; margin-bottom: 0pt; margin-left: 0.38in; margin-top: 6.72pt; text-indent: -0.38in; unicode-bidi: embed; word-break: normal;">
<span style="text-indent: -0.38in;">There are now a lot of books out on Hadoop and the different frameworks as well as the NoSQL databases. You can find the right book that fits your personal reading style. There are also lots of Youtube videos. With a little time you can find ones of high quality. </span><br />
<b><br /></b>
<b>Engineering Blogs:</b><br />
<ul>
<li><span style="text-indent: -0.38in;">/hortonworks.com/community/</span></li>
<li><span style="text-indent: -0.38in;">/http://blog.cloudera.com/blog/</span></li>
<li><span style="text-indent: -0.38in;">/blog.cloudera.com/blog/</span></li>
<li><span style="text-indent: -0.38in;">/engineering.linkedln.com/hadoop</span></li>
<li><span style="text-indent: -0.38in;">/engineering.twitter.com</span></li>
<li><span style="text-indent: -0.38in;">/developer.yahoo.com/hadoop/</span></li>
</ul>
</div>
<div style="direction: ltr; margin-bottom: 0pt; margin-left: 0.38in; margin-top: 6.72pt; text-indent: -0.38in; unicode-bidi: embed; word-break: normal;">
<ul>
</ul>
<b>Hadoop Ecosystem</b><br />
<ul>
<li><a href="http://nosql.mypopescu.com/post/44639201775/the-hadoop-ecosystem-infographic" style="font-family: Helvetica;">http://nosql.mypopescu.com/post/44639201775/the-hadoop-ecosystem-infographic</a></li>
</ul>
<b>What is Hadoop?</b><br />
<div>
<div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; text-indent: 0in;">
</div>
<ul>
<li><span style="text-indent: 0in;">Mark Madsen - </span><a href="http://www.insideanalysis.com/2012/12/what-hadoop-is-what-is-isnt/" style="text-indent: 0in;">http://www.insideanalysis.com/2012/12/what-hadoop-is-what-is-isnt/</a></li>
<li><span style="text-indent: 0in;">Jim Walker - </span><span style="text-indent: 0in;"> </span><a href="http://www.youtube.com/watch?v=j6toE6Ke7k4" style="text-indent: 0in;">http://www.youtube.com/watch?v=j6toE6Ke7k4</a></li>
<li><span style="text-indent: 0in;">Expert panel - </span><a href="http://www.infoq.com/articles/HadoopVirtualPanel" style="text-indent: 0in;">http://www.infoq.com/articles/HadoopVirtualPanel</a></li>
</ul>
</div>
<div>
<div apple-content-edited="true">
<div style="-webkit-line-break: after-white-space; word-wrap: break-word;">
<div style="-webkit-line-break: after-white-space; word-wrap: break-word;">
<div>
<div style="font-family: Times;">
<b>Best Practices for Apache Hadoop Hardware</b></div>
</div>
<div style="font-family: Times;">
</div>
<ul>
<li><span style="font-family: Calibri, sans-serif; font-size: 14px; text-indent: -0.38in;"><a href="http://www.cloudera.com/content/cloudera/en/documentation/reference-architecture/latest.html">Cloudera Reference Architecture</a></span></li>
<li style="font-family: Calibri, sans-serif; font-size: 14px;"><a href="https://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware/">https://hortonworks.com/blog/best-practices-for-selecting-apache-hadoop-hardware</a></li>
<li><a href="http://www.cisco.com/en/US/docs/unified_computing/ucs/UCS_CVDs/CPA_for_BigData_with_Hortonworks.html" style="font-family: Calibri, sans-serif; font-size: 14px; text-indent: -0.38in;">Cisco Reference Architecture</a></li>
<li><a href="http://hortonworks.com/wp-content/uploads/2013/09/Hortonworks-Reference-Configuration-for-Dell.pdf" style="font-family: Calibri, sans-serif; font-size: 14px; text-indent: -0.38in;">Dell Reference Architecture</a></li>
<li><a href="http://hortonworks.com/wp-content/uploads/2013/10/HP_Reference_Architecture_Jun13.pdf" style="font-family: Calibri, sans-serif; font-size: 14px; text-indent: -0.38in;">HP Reference Architecture</a></li>
</ul>
</div>
<div style="-webkit-line-break: after-white-space; word-wrap: break-word;">
<div style="text-indent: -0.38in;">
<b>Data</b></div>
<div style="text-indent: -0.38in;">
</div>
<ul style="text-indent: -0.38in;">
<li style="font-family: Helvetica;"><a href="http://www.youtube.com/watch?v=y3nFfsTnY3M">Analyzing Sentiment Data</a></li>
</ul>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
</div>
<br />
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<div style="margin: 0px;">
Have fun and I look forward to any additional recommendations.</div>
</div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-33936747944153125712013-11-10T15:24:00.001-07:002013-11-10T16:13:31.079-07:00Thoughts on HDP2 and the Evolving Ecosystem around HadoopI've been working with in-memory databases and Hadoop since my days at VMware as a Tier One Specialist. I've spent the last year focusing 100% of my attention on Hortonworks Data Platform (HDP) and NoSQL databases. In the last few months I've done a very deep immersion of HDP2 and all the new features around Apache Hadoop 2 from the HDP perspective. As well as seeing the changes in the ecosystem around Hadoop.<br />
<br />
The analogy I've told my Oracle friends, is that HDP2 is transformational to HDP1 that same way Oracle 8 was to Oracle 7. Oracle 8 opened up lots of new functionality and features that changed what Oracle could do for businesses. Oracle RAC, Streams, Data Guard, Partitioning were the beginning of lots of new features that changed the way companies could use database software. HDP2 will have the same type of transformation on Apache Hadoop 1 customers. It's not just that HDP2 has new features, scalability, performance enhancements and high availability. It's that HDP2 is going to change how customers will use Hadoop. When I look at features like YARN, Knox (Security), Tez (real-time queries), Falcon (Data Lifecycle Management) and Accumulo, they completely change the potential and way Hadoop will be used. HDP2 is definitely not your grandfather's version of Hadoop. :) Then you look at the growth of the Hadoop ecosystem with new features and products from Spark, Storm, Kafka, Splunk, WanDisco, Rackspace, etc. Software products in the Hadoop ecosystem are transforming and evolving as fast as HDP. You also look at Microsoft (HDInsight) and Rackspace (Openstack) and you see the needle will move on Hadoop being used in the cloud. Last you look at the connectors, loaders and interfaces being written by the database vendors as well as the products coming from Informatica, Ab Initio and Quest and you see everyone is all in with Hadoop. <br />
<br />
I don't know what the Hadoop will look like a year from now but with the speed at which open source is changing the landscape, we know that a year from now Hadoop will be used in ways we haven't even imagined yet. An old quote, <i>"</i><span style="background-color: #e5e5dd; color: #330000; font-family: georgia, 'bookman old style', 'palatino linotype', 'book antiqua', palatino, 'trebuchet ms', helvetica, garamond, sans-serif, arial, verdana, 'avante garde', 'century gothic', 'comic sans ms', times, 'times new roman', serif;"><i>The race is not always to the swift, but to those who keep on running."</i> </span> For those that have jumped into the Hadoop highway you'd better keep running because things are not slowing down. :) Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-47756661491642452472013-11-10T10:25:00.004-07:002013-11-10T14:55:15.908-07:00Understanding What is a NoSQL Database?<div class="MsoNormal">
<b><span style="color: blue;">Everyone is trying to Understand NoSQL Databases</span></b></div>
<div class="MsoNormal">
When I work with customers that are looking at Big Data and Hadoop solutions, I am often asked to define what NoSQL databases are. There is a lot of information being written about NoSQL databases because they are one of the hot technology areas of Big Data. I'm going to help explain NoSQL databases to make it easier to understand how NoSQL databases fit into Big Data ecosystems. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b><span style="color: blue;">Understanding Big Data</span></b></div>
<div class="MsoNormal">
Traditional systems (relational databases and data warehouses) have been the way most organizations have stored, managed and analyzed data. These traditional systems are not going anywhere, what they do, they do well. However, today's data environment has changed significantly and traditional systems have difficulty working with a large part of the data of today (Big Data). Big data has been given a lot of different definitions, but what it really is, is a data environment that meets one or more of the following criteria:</div>
<div class="MsoNormal">
</div>
<ul>
<li><span style="text-indent: -0.25in;">A large amount of data to be stored, processed and analyzed.</span></li>
<li><span style="text-indent: -0.25in;">Data that often has large amounts of semi-structured or unstructured data.</span></li>
<li><span style="text-indent: -0.25in;">Data that can have large data ingestion rates.</span></li>
<li><span style="text-indent: -0.25in;">Large amounts of data that have to be processed quickly.</span><span style="text-indent: -0.25in;"> </span></li>
</ul>
<div class="MsoListParagraphCxSpFirst" style="text-indent: -0.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;">
<o:p></o:p></div>
<div class="MsoListParagraphCxSpLast" style="text-indent: -0.25in;">
<o:p></o:p></div>
<div class="MsoNormal">
<b><span style="color: blue;"> Traditional Systems Were Not Designed For BigData</span></b><o:p></o:p></div>
<div class="MsoNormal">
Traditional systems from their foundations were not designed to handle this type of data environment. Big data is an environment that exists when it gets too difficult or expensive for traditional systems to handle. Organizations are finding that data about you can be just as critical to business success as the data generated internally. Organizations are almost desperate to correlate internal data with data that is generated externally (social media, VOIP, machine data, RFID, geographical coordinates, videos, sound, etc). NoSQL systems are designed from the ground up to deal with this type of data environment cost effectively. Traditional database vendors are not wanting to miss out on this wave of Big Data but they are providing add ons to their systems where as NoSQL databases are designed from the ground up to work with Big Data. The other challenge with traditional systems is they are wanting to sell very expensive hardware and software licenses compared to the relatively very inexpensive open source solutions.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b><span style="color: blue;">What is NoSQL?</span></b></div>
<div class="MsoNormal">
NoSQL is a database management system that has characteristics and capabilities that can address big data in ways that traditional databases were not designed for. NoSQL solutions usually have the following features or characteristics:</div>
<div class="MsoNormal">
</div>
<ul>
<li><span style="text-indent: -0.25in;">Scalability of big data (100s of TB to PBs). </span><span style="text-indent: -0.25in;"> </span><span style="text-indent: -0.25in;">Horizontal scalability with x86 commodity hardware.</span></li>
<li><span style="text-indent: -0.25in;">Schema-on-read (versus traditional databases schema-on-write) makes it much easier to work with semi-structured and unstructured data. </span></li>
<li><span style="text-indent: -0.25in;">Data spread out using distributed file systems that use replicas for high availability.</span></li>
<li><span style="text-indent: -0.25in;">High availability and self-healing capability.</span></li>
<li><span style="text-indent: -0.25in;">Connectivity can include but not limited to SQL, Thrift, REST, JavaScript and APIs.</span></li>
</ul>
<br />
<div class="MsoNormal">
Here is the Wikipedia definition of NoSQL.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<i><span style="font-family: Helvetica;">A <b>NoSQL</b> database provides a mechanism for <a href="http://en.wikipedia.org/wiki/Data_storage"><span style="color: #092f9d; text-decoration: none;">storage</span></a> and <a href="http://en.wikipedia.org/wiki/Data_retrieval"><span style="color: #092f9d; text-decoration: none;">retrieval</span></a> of data that employs less constrained <a href="http://en.wikipedia.org/wiki/Consistency_model"><span style="color: #092f9d; text-decoration: none;">consistency models</span></a> than traditional <a href="http://en.wikipedia.org/wiki/Relational_database"><span style="color: #092f9d; text-decoration: none;">relational databases</span></a>. Motivations for this approach include simplicity of design, <a href="http://en.wikipedia.org/wiki/Horizontal_scaling#Horizontal_and_vertical_scaling"><span style="color: #092f9d; text-decoration: none;">horizontal scaling</span></a> and finer control over availability. NoSQL databases are often highly optimized <span style="color: #092f9d;">key–value stores</span> intended for simple retrieval and appending operations, with the goal being significant performance benefits in terms of latency and <a href="http://en.wikipedia.org/wiki/Throughput"><span style="color: #092f9d; text-decoration: none;">throughput</span></a>. NoSQL databases are finding significant and growing industry use in <a href="http://en.wikipedia.org/wiki/Big_data"><span style="color: #092f9d; text-decoration: none;">big data</span></a> and <a href="http://en.wikipedia.org/wiki/Real-time_web"><span style="color: #092f9d; text-decoration: none;">real-time web</span></a> applications. NoSQL systems are also referred to as "Not only SQL" to emphasize that they may in fact allow <a href="http://en.wikipedia.org/wiki/SQL"><span style="color: #092f9d; text-decoration: none;">SQL</span></a>-like query languages to be used.</span><o:p></o:p></i></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The term NoSQL is more of an approach or way to address data management versus being a rigid definition. There are different types of NoSQL databases and they often share certain characteristics but are optimized for specific types of data which then requires different capabilities and features. NoSQL may mean Not only SQL, or it may mean "No" SQL. A No SQL database may use APIs or JavaScript to access data versus traditional SQL. NoSQL datastores may be optimized for key-value, Columnar, Document-Oriented, XML, Graph and Object data structures. NoSQL databases are very scalable, have high availability and provide a highly level of parallelization for processing large volumes of data quickly. NoSQL solutions are evolving constantly. </div>
<br />
<div class="MsoNormal">
A number of the NoSQL databases can point to <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf">Google's BigTable</a> design as their parent source. Characteristics of Google BigTable include:<o:p></o:p></div>
<div class="MsoNormal">
</div>
<ul>
<li>Designed to support massive scalability of tens to hundreds of petabytes.</li>
<li>Move the programs to the data versus relational databases that move the data to the programs (memory).</li>
<li>Data is sorted using row keys.</li>
<li>Designed to be deployed in a clustered environment using x86 commodity hardware.</li>
<li>Supports compression algorithms.</li>
<li>Distributes data across local disk drives on commodity hardware supporting massive levels of IOPS.</li>
<li>Supports replicas of data for high availability. </li>
<li>Uses a parallel execution framework like Map Reduce or something similar for extremely high parallelization capabilities.</li>
</ul>
<br />
<div class="MsoNormal">
The two primary NoSQL databases supported by the <a href="http://hortonworks.com/products/hdp/">Hortonworks Data Platform</a> (HDP) are HBase and Accumulo. Here are some examples of NoSQL databases:<o:p></o:p></div>
<div class="MsoNormal">
</div>
<ul>
<li><b>HBase </b>(Columnar) – designed for optimized scanning of column data</li>
<li><b>Accumulo</b> – Key-value datastore that can maintain data consistency at the petabyte level, read and write in near real-time and contains cell-level security. Accumulo was developed at the National Security Agency.</li>
<li><b>Cassandra </b>– A real-time datastore that is highly scalable. Uses a peer-peer distributed system. Key oriented using column families. Supports primary and secondary databases. Uses CSQL for it's SQL language.</li>
<li><b>MongoDB </b>(document-oriented) – Highly scalable database runs MapReduce jobs using JavaScript</li>
<li><b>CouchDB</b> (document-oriented) – Highly scalable database that can survive just about anything except maybe a nuclear bomb. Uses JavaScript to access data.</li>
<li><b>Terracotta</b> – Uses a big memory approach to deliver fast high scalable systems.</li>
<li>Voldemort – A key-value distributed storage system.</li>
<li><b>MarkLogic</b> – Highly scalable XML based database management system.</li>
<li><b>Neo3J </b>(graph oriented) – A graph database that allows you to access your data in the form of a graph. A graph database gives you fast access to information associated with nodes and relationships.</li>
<li><b>VMware vFabric GemFire</b> (object entries) Uses key-value pairs for in-memory data management.</li>
<li><b>Redis </b>(key-value) – String oriented keys can be hashes, lists or sets. Entire data set is cached in memory with disk persistence. Highly scalable.</li>
<li><b>Riak</b> (key-value) – Text oriented, scalable system based on Amazon's Dynamo. </li>
</ul>
<br />
<div class="MsoNormal">
NoSQL databases are not designed to replace the traditional RDBMS. NoSQL databases are becoming part of the enterprise data platform for organizations and providing functionality that traditional systems do not handle well due to either the size, complexity of data or the volume of data being absorbed.<o:p></o:p><br />
<br />
<b><span style="color: blue;">NoSQL and SQL Analogies</span></b><br />
Here is another way of looking at NoSQL and SQL from a coauthor and friend, Steven Jones.<br />
<div class="p1">
Think of SQL and No SQL in terms of distinctions. Here are some word pictures of distinctions:</div>
<div class="p1">
No SQL databases handle fast answers to messy big piles of data.</div>
<div class="p1">
SQL databases handle deliberate logically churned out answers to well organized and groomed to the essentials data. </div>
<div class="p1">
Think of them as the odd couple one is Felix and the other is Max.</div>
<div class="p1">
Or No SQL is detective Columbo and SQL is detective Monk.</div>
<div class="p1">
One is a an answer from a hot mess the other is a architects blueprint where logical reasoning reduces truth to it's essence.</div>
<div class="p1">
<br /></div>
<div class="p1">
No SQL is rap or dubstep, SQL is classical.</div>
<div class="p2">
<br /></div>
<div class="p1">
SQL assumes by it's order or structure you know the questions to be asked.</div>
<div class="p1">
NO SQL assumes no order until you can think of a question or a need in the moment.</div>
<div class="p2">
<br /></div>
<div class="p1">
SQL is mathematically derived.</div>
<br />
<div class="p1">
NO SQL is merely reasonably ordered.</div>
</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b><span style="color: blue;">Rankings of Different Types of Databases from DB-Engines <o:p></o:p>(November 2013)</span></b></div>
<div class="MsoNormal">
DB-Engines ranks <a href="http://db-engines.com/en/ranking/wide+column+store">Wide Column Stores</a> <o:p></o:p></div>
<div class="MsoNormal">
</div>
<ol>
<li>Cassandra</li>
<li>HBase</li>
<li>Accumulo</li>
<li>Hypertable</li>
</ol>
<br />
<div class="MsoNormal">
DB-Engines ranks <a href="http://db-engines.com/en/ranking/document+store">Document Stores</a> <o:p></o:p></div>
<div class="MsoNormal">
</div>
<ol>
<li>MongoDB</li>
<li>CouchDB</li>
<li>Couchbase</li>
<li>RavenDB</li>
<li>Gemfire</li>
</ol>
<br />
<div class="MsoNormal">
DB-Engines ranks <a href="http://db-engines.com/en/ranking/graph+dbms">Graph DBMS</a> <o:p></o:p></div>
<div class="MsoNormal">
</div>
<ol>
<li>Neo4J</li>
<li>Titan</li>
<li>OrientDB</li>
<li>Dex</li>
</ol>
<br />
<div class="MsoNormal">
DB-Engines ranks <a href="http://db-engines.com/en/ranking/key-value+store">Key Value Stores </a> <o:p></o:p></div>
<div class="MsoNormal">
</div>
<ol>
<li>Redis</li>
<li>Memcached</li>
<li>Riak</li>
<li>Ehcache</li>
<li>DynamoDB</li>
</ol>
Note: Berkeley DB (7<sup>th</sup>), Coherence (8<sup>th</sup>), Oracle NoSQL (10<sup>th</sup>)<br />
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
DB-Engines ranks <a href="http://db-engines.com/en/ranking/object+oriented+dbms">Object Oriented DBMS</a> <o:p></o:p></div>
<div class="MsoNormal">
</div>
<ol>
<li>Cache</li>
<li>Db4o</li>
<li>Versant Object Database</li>
</ol>
<br />
<div class="MsoNormal">
DB-Engines ranks <a href="http://db-engines.com/en/ranking/relational+dbms">Relational DBMS</a> <o:p></o:p></div>
<div class="MsoNormal">
</div>
<ol>
<li>Oracle</li>
<li>MySQL</li>
<li>Microsoft SQL Server</li>
<li>PostgreSQL</li>
<li>DB2</li>
</ol>
<br />
<div class="MsoNormal">
Note: Teradata (9<sup>th</sup>), Hive (12<sup>th</sup>), SAP HANA (16<sup>th</sup>)<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div>
<br /></div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com1tag:blogger.com,1999:blog-8585595729814983645.post-56272379890739446702013-11-04T13:27:00.001-07:002013-11-10T14:55:38.071-07:00Starting HDP2 Services in the Right OrderOne of the top ten issues new administrators have with Hadoop is starting and stopping Hadoop services in the right order. When starting and stopping a Hadoop 2 cluster (HDP2) use the following order to keep you between the yellow lines.<br />
<br />
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Revision>0</o:Revision>
<o:TotalTime>0</o:TotalTime>
<o:Pages>1</o:Pages>
<o:Words>30</o:Words>
<o:Characters>171</o:Characters>
<o:Company>Hortonworks</o:Company>
<o:Lines>1</o:Lines>
<o:Paragraphs>1</o:Paragraphs>
<o:CharactersWithSpaces>200</o:CharactersWithSpaces>
<o:Version>14.0</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>JA</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
<w:UseFELayout/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
table.MsoTableGrid
{mso-style-name:"Table Grid";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-priority:59;
mso-style-unhide:no;
border:solid windowtext 1.0pt;
mso-border-alt:solid windowtext .5pt;
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-border-insideh:.5pt solid windowtext;
mso-border-insidev:.5pt solid windowtext;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--><br />
<table border="1" cellpadding="0" cellspacing="0" class="MsoTableGrid" style="border-collapse: collapse; border: none; mso-border-alt: solid windowtext .5pt; mso-padding-alt: 0in 5.4pt 0in 5.4pt; mso-yfti-tbllook: 1184;">
<tbody>
<tr style="mso-yfti-firstrow: yes; mso-yfti-irow: 0;">
<td style="border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
HDFS<o:p></o:p></div>
</td>
<td style="border-left: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Storage<o:p></o:p></div>
</td>
</tr>
<tr style="mso-yfti-irow: 1;">
<td style="border-top: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
YARN<o:p></o:p></div>
</td>
<td style="border-bottom: solid windowtext 1.0pt; border-left: none; border-right: solid windowtext 1.0pt; border-top: none; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Processing<o:p></o:p></div>
</td>
</tr>
<tr style="mso-yfti-irow: 2;">
<td style="border-top: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
ZooKeeper<o:p></o:p></div>
</td>
<td style="border-bottom: solid windowtext 1.0pt; border-left: none; border-right: solid windowtext 1.0pt; border-top: none; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Coordination
Service<o:p></o:p></div>
</td>
</tr>
<tr style="mso-yfti-irow: 3;">
<td style="border-top: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
HBase<o:p></o:p></div>
</td>
<td style="border-bottom: solid windowtext 1.0pt; border-left: none; border-right: solid windowtext 1.0pt; border-top: none; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Columnar
Database<o:p></o:p></div>
</td>
</tr>
<tr style="mso-yfti-irow: 4;">
<td style="border-top: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
Hive Metastore<o:p></o:p></div>
</td>
<td style="border-bottom: solid windowtext 1.0pt; border-left: none; border-right: solid windowtext 1.0pt; border-top: none; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Metadata<o:p></o:p></div>
</td>
</tr>
<tr style="mso-yfti-irow: 5;">
<td style="border-top: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
HiveServer2<o:p></o:p></div>
</td>
<td style="border-bottom: solid windowtext 1.0pt; border-left: none; border-right: solid windowtext 1.0pt; border-top: none; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
JDBC Connectivity<o:p></o:p></div>
</td>
</tr>
<tr style="mso-yfti-irow: 6;">
<td style="border-top: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
WebHCat<o:p></o:p></div>
</td>
<td style="border-bottom: solid windowtext 1.0pt; border-left: none; border-right: solid windowtext 1.0pt; border-top: none; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Metadata
Resources<o:p></o:p></div>
</td>
</tr>
<tr style="mso-yfti-irow: 7; mso-yfti-lastrow: yes;">
<td style="border-top: none; border: solid windowtext 1.0pt; mso-border-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpFirst" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
Oozie<o:p></o:p></div>
</td>
<td style="border-bottom: solid windowtext 1.0pt; border-left: none; border-right: solid windowtext 1.0pt; border-top: none; mso-border-alt: solid windowtext .5pt; mso-border-left-alt: solid windowtext .5pt; mso-border-top-alt: solid windowtext .5pt; padding: 0in 5.4pt 0in 5.4pt; width: 159.6pt;" valign="top" width="160"><div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Workflow
/ Scheduler<o:p></o:p></div>
</td>
</tr>
</tbody></table>
<br />
Take care of your Hadoop cluster and it will take care of you. :)<br />
<br />Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-46100365209201654302013-10-26T23:29:00.001-06:002013-10-27T12:07:06.661-06:00Demystifying Hadoop for Data Architects, DBAs, Devs and BI Teams<div class="MsoNormal">
<b><span style="color: blue;">Introduction</span></b></div>
<div class="MsoNormal">
I started doing Demystifying series on subjects such as database technologies, infrastructure and Java since back in the Oracle 8.0 days.<span style="mso-spacerun: yes;"> </span>The topics have
ranged from Demystifying Oracle RAC, Demystifying Oracle Fusion, Demystifying
MySQL, etc.<span style="mso-spacerun: yes;"> </span>So I guess it's time to
Demystify Hadoop.<span style="mso-spacerun: yes;"> </span><span style="font-family: Wingdings; mso-ascii-font-family: Helvetica; mso-char-type: symbol; mso-hansi-font-family: Helvetica; mso-symbol-font-family: Wingdings;"><span style="mso-char-type: symbol; mso-symbol-font-family: Wingdings;">J</span></span><br />
<br />
Whether you are talking Oracle RAC, Oracle ExaData, MySQL, SQL Server, DB2,
Teradata or Application Servers, it's really all about the data. Companies are constantly striving to make faster business decisions with higher degrees of accuracy. Traditional systems such as Oracle, SQL
Server, IBM, Teradata, etc. are scaling their systems to store hundreds of
terabytes and even petabytes, with hardware that keeps getting faster and
faster. However these traditional systems
were designed for transaction systems and have a lot of difficulties working
with big data. I'm going to talk to you
about why these traditional systems are not designed for big data and we're
going to talk about how Hadoop is the right technology at the right time to
address Big Data.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b><span style="color: blue;">What's The Deal About Big Data</span></b></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Across the board, industry analyst firms consistently report almost unimaginable numbers on the growth of data. The traditional data in relational databases and data warehouses are growing at incredible rates. The traditional data is a challenge by itself (show in Enterprise data below).
<br />
The big news though is VOIP, social media and machine data are growing at exponential rates and are completely dwarfing the data growth of traditional systems. Most organizations are learning that this data is just as critical to making business decisions as their traditional data. This non-traditional data is usually semi-structured and unstructured data. Examples of this data include web logs, mobile web, click stream, spatial and GPS coordinates, sensor data, RFID, video, audio and image data. The chart below shows the growth of non-traditional data (Machine Data, Social Media, VoIP) relative to traditional data (Enterprise Data). The source is the IDC.</div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRZo5U3r565X9tiWlvl4Ku7zmLmPDHdau91VIRG-8Lq0vZIcfegQ1ByGbgyV5sflJG3VulvPpCrVlRBHOjXKQhe6IbGIWCIvCiv9fojVMD-4IGb6NdCwahegyu5RNq33Q_Oo1h_lkBuv6x/s1600/Screen+Shot+2013-10-26+at+7.56.52+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="237" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjRZo5U3r565X9tiWlvl4Ku7zmLmPDHdau91VIRG-8Lq0vZIcfegQ1ByGbgyV5sflJG3VulvPpCrVlRBHOjXKQhe6IbGIWCIvCiv9fojVMD-4IGb6NdCwahegyu5RNq33Q_Oo1h_lkBuv6x/s400/Screen+Shot+2013-10-26+at+7.56.52+PM.png" width="400" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br />
<br />
<br />
<br />
<br />
<br />
Data becomes big data when the volume, velocity, and/or variety of data gets to the point where it is too difficult or too expensive for traditional systems to handle. Big data is not when when the data reaches a certain volume velocity of data ingestion or type of data. Big data is when traditional systems are no longer viable solutions due to the volume, velocity and/or variety of data. A good first book on big data to read is <a href="http://www.amazon.com/Disruptive-Possibilities-Data-Changes-Everything-ebook/dp/B00CLH387W">Disruptive Possibilities</a>.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b><span style="color: blue;">The Big Data Challenge</span></b></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The reason traditional systems have a problem with big data is they were not designed for big data.</div>
<div class="MsoNormal">
</div>
<ul>
<li><b><span style="color: red;">Problem - Schema-On-Write: </span></b>Traditional systems are schema-on-write. Schema-on-write requires the data be validated when it is written. This means that a lot of work has to be done before new data sources can be analyzed. Here is an example of the problem with this. Let's say a company wants to start analyzing a new source of data from unstructured or semi-structure sources. A company will usually spend months (i.e. 3-6 months) designing schemas, etc. to store the data in a data warehouse. That is 3 - 6 months that they are not able to use the data to make business decisions. Then when the data warehouse design is complete 6 months later, often the data has changed again. If you look at data structures from social media, they change on a regular basis. The schema-on-write environment is too slow and non-flexible to deal with the dynamics of semi-structured and unstructured data. The other problem with unstructured data is traditional systems usually use BLOBs to handle unstructured data. Anyone that has worked with BLOBs for big data, would rather get their gums scraped than work with BLOB data types in traditional systems. </li>
<li><span style="color: #6aa84f;"><b>Solution - Schema-On-Read:</b> </span> Hadoop systems are schema-on-read. Which means any data can be written to the storage system immediately. Data is not validated until it is read. This allows Hadoop systems to load any type of data in, and begin analyzing it quickly. Hadoop systems have extremely short data latency compared to traditional systems. Data latency is the differential between data hitting the disk and the data being able to provide business value. Schema-on-read gives Hadoop a tremendous advantage over traditional systems in an area that matters most. Being able to analyze the data faster to make business decisions. </li>
<li><b><span style="color: red;">Problem - Cost of Storage: </span></b>Traditional systems use SAN storage. As organizations start to ingest larger volumes of data, SAN storage is cost prohibitive.</li>
<li><b><span style="color: #6aa84f;">Solution - Local Storage: </span></b>Hadoop is able to use HDFS, a distributed file system that leverages local disks on commodity servers. SAN storage is about $1.20/GB while local storage is about $.04/GB per storage. Hadoop's HDFS creates three replicas by default for high availability. So at .12 cents per GB it is still a fraction of the cost of traditional SAN storage. As organizations are storing much larger volumes of data, the traditional SAN storage is too expensive to be a viable solution. </li>
<li><span style="color: red;"><b>Problem - Cost of Proprietary Hardware: </b></span><br />Large proprietary hardware solutions can be cost prohibitive when deployed to process extremely large volumes of data. Organizations are spending millions of dollars in hardware and software licensing costs while supporting large data environments. Organizations are often growing their hardware in million dollar increments to handle the increasing data.</li>
<li><b><span style="color: #6aa84f;">Solution: Commodity Hardware:</span></b> People new to Hadoop do not realize that it is possible to build a high performance super computer environment using Hadoop. One customer was looking at a proprietary hardware vendor for a solution. The hardware vendor's solution was $1.2 million in hardware costs and $3 million in software licensing. The Hadoop solution for the same processing power was $400,000 for hardware, the software was free and the support costs were included. Since data volumes would be constantly increasing, the proprietary solution would be growing in $500k and $1 million dollar increments and the Hadoop solution would be growing in $10,000 and $100,000 increments.</li>
<li><span style="color: #cc0000;"><b>Problem - Complexity: </b></span>When you look at any traditional proprietary solution it is full of extremely complex silos of system administrators, DBAs, application server teams, storage teams and network teams. Often there is one DBA for every 40 - 50 database servers. Anyone running traditional systems knows that complex systems fail in complex ways. </li>
<li><span style="color: #6aa84f;"><b>Solution - Simplicity: </b></span>Since Hadoop uses commodity hardware, it is a hardware stack that one person can understand. Numerous organizations running Hadoop have one administrator for every 1000 data nodes. </li>
<li><span style="color: red;"><b>Problem - Causation: </b></span> Because data is so expensive to store in traditional systems, data is filtered, aggregated and large volumes are thrown out due to the cost of storage. Minimizing the data to be analyzed reduces the accuracy and confidence of the results. </li>
<li><span style="color: #6aa84f;"><b>Solution - Correlation: </b></span> Due to the relatively low cost of storage of Hadoop, the detailed records are stored in Hadoop's storage system HDFS. Traditional data can then be analyzed with non-traditional data to find correlation points that can provide much higher accuracy of data analysis. We are moving to a world of correlation because the accuracy and confidence of the results is factors higher than traditional systems. An example, the Center for Disease and Control (CDC) used to take 28 - 30 days to identify an outbreak. The CDC had traditionally obtained data from doctors and hospitals. This data was then analyzed in large volumes and cross referenced sources in order to validate the data. The next step then was going back a number of years and correlating it with the data from social media sources such as Twitter and Facebook. They validated the accuracy of the correlation results going back years. Now using big data, the CDC can identify an outbreak in 5 - 6 hours. Organizations are seeing big data as transformational. </li>
<li><span style="color: red;"><b>Problem - Bringing Data to the Programs: </b></span> In relational databases and data warehouses, data is loaded usually in 8k - 16k data blocks into memory so programs can process the data. When you need to process 10s, 100s and 1000s of TB this model completely breaks down or is extremely expensive to implement. </li>
<li><span style="color: #6aa84f;"><b>Solution - Bringing Programs to the Data: </b></span>With Hadoop, the programs are moved to the data. Hadoop data is spread across all the disks on the local servers that make up the Hadoop cluster in 128MB (default) increments. Individual programs, one for every block runs in parallel across the cluster delivering a very high level of parallelization and IOPS. Which means Hadoop systems can process extremely large volumes of data much faster than traditional systems at a fraction of the cost due to the architecture model.</li>
</ul>
<div>
Successfully leveraging big data is transforming how organizations are analyzing data and making business decisions. The "value" of the results of big data has most companies racing to build Hadoop solutions to do data analysis. The diagram below show how significant big data is. Often customers bring in Hortonworks and say, we need you to make sure we "out Hadoop" our competitors. Hadoop is not just a transformation technology it's the strategic difference between success and failure.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMJwD0wUOHQAYMLEOZR1ZIxKQTmpM4vQI3Apw5zW7cnupOy7Dh8JMmQg4bZlZ1i_oIWgKbS9mtJ2Altk1OKFrfaXJpSdmQ3dFmJb-9jiQYnBAlnLj9MdBzKQo4nyHpWKdxWnUboJihHqFm/s1600/Screen+Shot+2013-10-26+at+10.39.44+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="388" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjMJwD0wUOHQAYMLEOZR1ZIxKQTmpM4vQI3Apw5zW7cnupOy7Dh8JMmQg4bZlZ1i_oIWgKbS9mtJ2Altk1OKFrfaXJpSdmQ3dFmJb-9jiQYnBAlnLj9MdBzKQo4nyHpWKdxWnUboJihHqFm/s640/Screen+Shot+2013-10-26+at+10.39.44+PM.png" width="640" /></a></div>
<div>
<b><span style="color: blue;">Examples of New Types of Data</span></b></div>
<div>
<b><span style="color: blue;"><br /></span></b></div>
<div>
Hadoop is being used by every type of organization ranging from Internet companies, Telecommunication firms, Banks, Credit Card companies, gaming companies, on-line retail companies,etc. Anyone that needs to analyze data is moving to Hadoop. Here are examples of data being processed by organizations.</div>
<div>
<b><span style="color: blue;"><br /></span></b></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9sGXTQWFXImlg2mW80a3L_fB1bf3mk_vSOhg1qLBF_qIyU8HaiCcepBRDi6g5r-PfYNUI03lsHbqEkHnP5vyrJP_j-0H1GeTX_zP0sTA74wjoAnw0fXdZ1jLg36Kjn2y92axxoZm1kEHi/s1600/Screen+Shot+2013-10-26+at+11.19.12+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="382" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9sGXTQWFXImlg2mW80a3L_fB1bf3mk_vSOhg1qLBF_qIyU8HaiCcepBRDi6g5r-PfYNUI03lsHbqEkHnP5vyrJP_j-0H1GeTX_zP0sTA74wjoAnw0fXdZ1jLg36Kjn2y92axxoZm1kEHi/s640/Screen+Shot+2013-10-26+at+11.19.12+PM.png" width="640" /></a></div>
<div>
<div>
<b><span style="color: blue;">Hadoop Distributions - The Hortonworks Data Platform (HDP)</span></b></div>
<br />
<div class="MsoNormal">
A Hadoop distribution is made up of a number of different open source frameworks. An organization can build their own distribution from the different versions of each framework. Anyone running a production system needs an enterprise version of a distribution. Since Hortonworks has key committers and project leaders on the different open source framework projects, we use our expertise to pick the latest version of a framework that works reliably with the other frameworks. Hortonworks then goes out and tests a distribution and builds an enterprise distribution of Hadoop. For example, Hadoop 2 went GA the week of October 15th, 2013. Hadoop 2 has been running on Hadoop clusters with thousands of nodes since January of 2013, being tested by the large set of Hortonworks partners. <br />
<br />
The Hortonworks distribution is called the Hortonworks Data Platform. The new GA release of Hadoop 2 by Hortonworks is called <a href="http://hortonworks.com/products/hdp/">HDP 2.</a> Hortonworks runs on a true open source model. Every line of code written by Hortonworks for Hadoop is given back to the Apache Software Foundation (ASF). When means every Hortonworks distribution is only a few patch sets off of the main Apache baseline. The result is HDP2 is extremely stable from a support perspective and protects an organization from vendor lock in. Here is an example of the HDP2 distribution and the key frameworks associated with it. Hortonworks builds it's reputation on the "enterprise" quality of it's distribution. The industry is recognizing the platform expertise of Hortonworks. <br />
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
There are a number of different Hadoop distributions. Some of the distributions have been around longer than Hortonworks. In my expert opinion, the reason to choose Hortonworks is: </div>
<ul style="-webkit-text-stroke-width: 0px; color: black; font-family: Times; font-size: medium; font-style: normal; font-variant: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px;">
<li style="font-weight: normal;"><b>Platform Expertise</b> - Hadoop is a platform that frameworks run on. Hortonworks' entire focus is on the enterprise platform for Hadoop. Hortonworks is not trying to be everything for everybody. Hortonworks focus is the Hadoop platform. Hortonworks has demonstrated this in a number of ways. Hadoop is open source and is developed as a community. Hortonworks is by far the largest contributor of source lines of code for Hadoop. </li>
<li style="font-weight: normal;"><b>Defining the Roadmap:</b> More and more large vendors, are seeing Hortonworks as defining the road map for Hadoop. Hortonworks while working with the open source community, has been a key leader in the design and architecture of YARN. YARN is the foundational processing component of Hadoop 2. The platform expertise demonstrated by Hortonworks is moving a number of the largest vendors in the world to move to the Hortonworks Data Platform (HDP). This is why you seen major vendors such as Microsoft and Teradata choosing HDP as the Hadoop distribution of choice.</li>
<li style="font-weight: normal;"><b>Enterprise Version of Hadoop - </b>Hortonworks is focused as being the definitive enterprise distribution of Hadoop. </li>
<li style="font-weight: normal;"><b>Open Source </b>- Hortonworks is based on an open source model. Every line of code created goes back into the Apache Software Foundation. Other distributions are proprietary or open source proprietary. The proprietary solutions create vendor lock in which more and more companies are trying to avoid. With Hortonworks contributing all code back to the Apache Software Foundation it minimizes support issues.</li>
<li><b><a href="http://hortonworks.com/about-us/news/hadoop-on-windows/">Windows</a> and Linux </b>- HDP is the only Hadoop distribution that runs on Linux and Windows.</li>
</ul>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgD_WB1HM5prklY1qjdCfvAMxtgF3VdIUCfQ6zD48MUQn_m55cZ9pHQO-IkzIGHoqNhOgs20an8shuZiNKF46-CZw46XaUbHzzjJdScjtdBKB4-LxJmgHfBrkUBcpzr0wimWDv9rQbOoYaB/s1600/Screen+Shot+2013-10-27+at+6.43.20+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="369" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgD_WB1HM5prklY1qjdCfvAMxtgF3VdIUCfQ6zD48MUQn_m55cZ9pHQO-IkzIGHoqNhOgs20an8shuZiNKF46-CZw46XaUbHzzjJdScjtdBKB4-LxJmgHfBrkUBcpzr0wimWDv9rQbOoYaB/s640/Screen+Shot+2013-10-27+at+6.43.20+AM.png" width="640" /></a></div>
The two main frameworks of Hadoop are the Hadoop Distributed File System (HDFS) which provides the storage and I/O and YARN with is a distributed parallel processing framework.<br />
<br />
<b>YARN</b><br />
<a href="http://hortonworks.com/hadoop/yarn/">YARN </a>(Yet Another Resource Negotiator) is the foundation for parallel processing in Hadoop. YARN is:<br />
<ul>
<li>Scaleable to 10,000+ data node systems. </li>
<li>Supports different types of workloads such as batch, real-time queries (<a href="http://hortonworks.com/hadoop/tez/">Tez</a>), streaming, graphing data, in-memory processing, messaging systems, streaming video, etc. You can think of YARN as a highly scalable and parallel processing operating system that supports all kinds of different types of workloads. </li>
<li>Supports batch processing providing high throughput performing sequential read scans.</li>
<li>Supports real time interactive queries with low latency and random reads.</li>
</ul>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi310JYVAYjgz8dCTesdSzVRvpdHFKC54QU9Evt1dD3TrSUYPYDVC4gMOo84WwFZXkVjU9xBCaYr4b8xVgBXPXH5GCP8mr-e2d5IEFeVbLcPRE0TW2EyTlZpcdtDC6cNi5DneVo76byQWRs/s1600/Screen+Shot+2013-10-27+at+6.49.27+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="228" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi310JYVAYjgz8dCTesdSzVRvpdHFKC54QU9Evt1dD3TrSUYPYDVC4gMOo84WwFZXkVjU9xBCaYr4b8xVgBXPXH5GCP8mr-e2d5IEFeVbLcPRE0TW2EyTlZpcdtDC6cNi5DneVo76byQWRs/s640/Screen+Shot+2013-10-27+at+6.49.27+AM.png" width="640" /></a></div>
<br />
<b>HDFS 2</b><br />
<a href="http://hadoop.apache.org/docs/current/">HDFS </a>uses NameNodes (master servers) and DataNodes (slave servers) to provide the I/O for Hadoop. The NameNodes manage the meta data. NameNodes can be federated (multiples) for scalability. Each NameNode can have a standby NameNode for failover (active-passive). All the user data is stored on the DataNodes. Data is distributed across all the disks in 128MB - 1GB block sizes. The data has 3 replicas (default) for high availability. HDFS provides a solution similar to striping and mirroring using local disks.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik9Oqx0u8iCNqSMXs9XaeU-h63pdSJH2SdPK7w6u1JjEeAw0XsblYzy8HaDnzNBNpg6Klco2-MR7w-7rpbM4rmOoP2DEUAnJ9vzcBSLd3kOekStGC_k6JX-aktYCDx2hbdj0Xv9uzhVvh6/s1600/Screen+Shot+2013-10-27+at+6.59.34+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="386" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik9Oqx0u8iCNqSMXs9XaeU-h63pdSJH2SdPK7w6u1JjEeAw0XsblYzy8HaDnzNBNpg6Klco2-MR7w-7rpbM4rmOoP2DEUAnJ9vzcBSLd3kOekStGC_k6JX-aktYCDx2hbdj0Xv9uzhVvh6/s640/Screen+Shot+2013-10-27+at+6.59.34+AM.png" width="640" /></a></div>
<br />
<b>Additional Frameworks</b><br />
Here is a summary of some of the key frameworks that make up HDP 2.<br />
<ul>
<li><a href="http://hive.apache.org/"><b>Hive</b> </a>- A data warehouse infrastructure than runs on top of Hadoop. Hive supports SQL queries, star schemas, partitioning, join optimizations, caching of data, etc. All the standard features you'd expect to have in a data warehouse. Hive lets you process Hadoop data using a SQL language.</li>
<li><a href="http://pig.apache.org/"><b>Pig</b> </a>- A scripting language for processing Hadoop data in parallel.</li>
<li><b><a href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html">MapReduce</a> </b>- Java applications that can process data in parallel.</li>
<li><b><a href="http://hortonworks.com/hadoop/ambari/">Ambari </a></b>- An open source management interface for installing, monitoring and managing a Hadoop cluster. Ambari has also been selected as the management interface for OpenStack.</li>
<li><b><a href="http://hbase.apache.org/">HBase </a></b>- A NoSQL columnar database for providing extremely hast scanning of column data for analytics.</li>
<li><b><a href="http://sqoop.apache.org/">Scoop</a>, <a href="http://flume.apache.org/">Flume</a> and <a href="http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/">WebHDFS</a></b> - tools providing large data ingestion for Hadoop using SQL, streaming and REST API interfaces.</li>
<li><b><a href="http://oozie.apache.org/">Oozie</a></b> - A workflow manager and scheduler.</li>
<li><b><a href="http://zookeeper.apache.org/">Zookeeper </a></b>- A coordinator infrastructure</li>
<li><a href="http://mahout.apache.org/">Mahout </a>- a machine learning library supporting Recommendation, Clustering, Classification and Frequent Itemset mining. </li>
<li><a href="http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.0.0.2/bk_installing_manually_book/content/rpm-chap-hue-4.html">Hue </a>- is a Web interface that contains a file browser for HDFS, a Job Browser for YARN, an HBase Browser, Query Editors for Hive, Pig and Sqoop and a Zookeeper browser.<div class="p1">
<br /></div>
</li>
</ul>
</div>
</div>
<div>
<b><span style="color: blue;">Hadoop - A Super Computing Platform</span></b></div>
<br />
<div class="MsoNormal">
Hadoop is a solution that leverages commodity hardware to build a high performance super computing environment. Hadoop contains master nodes and data nodes. HDFS is the distributed file system that provides high availability and high performance. HDFS is made up of a number of data nodes that break a file into multiple blocks. The block sizes are usually in 128MB - 1GB in size. Each block is replicated for high availability. YARN is a distributed processing architecture than can distribute the work load across the data nodes in a Hadoop cluster. People new to Hadoop do not realize the massive amount of IOPS that commodity X86 servers can generate with local disks.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
In the diagram below:</div>
<div class="MsoNormal">
HDFS - distributes data blocks across all the local disks in a cluster. This allows the cluster to leverage the IOPS that local disks can generate across all the local servers. When a process needs to run, the programs are distributed in parallel across all the data nodes to create an extremely high performance parallel environment. Without looking into the details, the main point is this is a super computer environment that can leverage parallelization for processing and leverage the massive amounts of IOPS that local disks can generate running across multiple data nodes as a distributed file system. The diagram shows multiple parallel processes running across a large volume of local disks running as a single distributed file system. </div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGG4AB8K6iKOzGDVZaPeWipR_NAKzvwF3d-9BLCbF3FDuQG6idjQD3mfpQSbhiYNbu6-FDBanhs8I0kNeGcd-co1fkaAIxPcLdUqRYPRUjUiAEXALMWiCXiZMrojNJ-phf6VhQKnMY6nKZ/s1600/Screen+Shot+2013-10-26+at+10.56.19+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="378" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGG4AB8K6iKOzGDVZaPeWipR_NAKzvwF3d-9BLCbF3FDuQG6idjQD3mfpQSbhiYNbu6-FDBanhs8I0kNeGcd-co1fkaAIxPcLdUqRYPRUjUiAEXALMWiCXiZMrojNJ-phf6VhQKnMY6nKZ/s640/Screen+Shot+2013-10-26+at+10.56.19+PM.png" width="640" /></a></div>
<div class="MsoNormal">
Hadoop is linearly scalable with commodity hardware. If a Hadoop cluster cannot handle the workload, an administrator can add some data node servers using local disks to increase processing and IOPS. Hadoop is linearly scalable at commodity hardware pricing. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b><span style="color: blue;">Summary - Demystifying Hadoop</span></b></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Hadoop is not replacing anything. Hadoop has become another component in an organizations enterprise data platform. This diagram shows that Hadoop (Big Data Refinery) can ingest data from all types of different sources. Hadoop then interacts and has data flows with traditional systems that provide transactions and interactions (relational databases) and business intelligence and analytic systems (data warehouses). </div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtK3Ylq5vLSaAXCmUjR9j41mAIYRE-BduduepuNfnt4Nue3Uhtn8JI-YBXktZQhq0p_F-gYwg4qDVfY0ELaykkf9wIWab8cmpJnv2MwgkoKdDLxRNsY509sBjItRmgvqvXS6mr2lAk8PDK/s1600/Screen+Shot+2013-10-27+at+7.14.42+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="382" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtK3Ylq5vLSaAXCmUjR9j41mAIYRE-BduduepuNfnt4Nue3Uhtn8JI-YBXktZQhq0p_F-gYwg4qDVfY0ELaykkf9wIWab8cmpJnv2MwgkoKdDLxRNsY509sBjItRmgvqvXS6mr2lAk8PDK/s640/Screen+Shot+2013-10-27+at+7.14.42+AM.png" width="640" /></a></div>
<br />
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com2tag:blogger.com,1999:blog-8585595729814983645.post-83212539090031854792013-09-24T15:59:00.000-06:002013-09-24T16:01:00.682-06:00Oracle 12c Creating Large Demand for New Technical Skills<b>Oracle 12c Features and Products will Create Largest Transformation in Technical Skills in Oracle's History</b><br />
The Oracle Open World conference is showing Oracle technical and business people that there is going to be a tremendous demand for new skill sets. It's not just their new announcements, it's that some of their existing products have matured to a new level. So get ready, this is going to create tremendous demand for those who build their technical skills to match the demand that is hitting the Oracle ecosystem this week. <br />
<br />
<b>Oracle Open World Highlights</b><br />
Here are some of the areas that are going to require new and improved skill sets.<br />
<ul>
<li>Big Data, Hadoop and Analytics</li>
<ul>
<li>Oracle promoting Oracle Big Data Appliance, highlighting using an engineered packaged system for Hadoop.</li>
</ul>
<li>Oracle and Microsoft: The Cloud OS</li>
<ul>
<li>This is going to create a big demand for the skills to deploy DBaaS (database as a service). </li>
<li>You can take your licenses and move them from on-premise to the Cloud on Windows Azure.</li>
<li>Java fully supported in Windows Azure.</li>
<li>Oracle license mobility to Windows Azure.</li>
<li>Oracle to offer Oracle Linux on Windows Azure.</li>
<li>Oracle and Microsoft building common cloud platform, the Cloud OS.</li>
<li>Oracle software on Windows Server Hyper-V and Windows Azure.</li>
<li>Message to ISVs, get going on this.</li>
</ul>
</ul>
<ul>
<li>Oracle Database as a Service</li>
<ul>
<li>Monthly subscription pricing, pay as you go.</li>
<li>Oracle manages the database for you.</li>
<ul>
<li>Quarterly patching and upgrades with SLAs.</li>
<li>Automated backup and point-in-time recovery</li>
</ul>
</ul>
<li>Java as a Service in Oracle Cloud</li>
<ul>
<li>Dedicated WebLogic clusters(s) on compute service.</li>
<li>Oracle backs up, patches manages WebLogic.</li>
<li>Full WLST, JMS, Root access, EM WebLogic Control.</li>
<li>Monthly subscription pricing.</li>
<li>Runs any Java EE application.</li>
<li>Elastic compute and storage.</li>
</ul>
<li>Oracle Multitenant: 12c Pluggable Databases.</li>
<ul>
<li>Oracle is abstracting out an Oracle database server as a software container within Oracle. This is a way of achieving benefits of virtualization, it's virtualizing Oracle as a software container versus virtualizing within a VM. Benefits:</li>
<ul>
<li>High consolidation density.</li>
<li>Rapid provisioning and cloning using SQL. A pluggable database is a software container that can move to a different database server.</li>
<li>New paradigms for rapid patching and upgrades.</li>
<li>Can manage multiple databases as one.</li>
</ul>
</ul>
<li>Oracle Database Backup, Logging and Recovery Appliance </li>
<ul>
<li>This Oracle database appliance will perform backups by creating a backup and then maintaining snapshots of changes. There is a new feature in RMAN (incremental-forever) that will take periodic snapshots to storage-equipped appliance, where backups will be saved.</li>
</ul>
<li>Oracle Database 12c In-Memory Database</li>
<ul>
<li>M6 32 Big Memory Machine - 32TB of DRAM.</li>
<li>32 SPARC M6 Chips, 1024 Memory DIMMS.</li>
<li>12 cores per processor, 96 threads per processor.</li>
<li>CPUs communicate using 384 port silicon switching network, 3TB/s bandwidth.</li>
</ul>
</ul>
<ul><div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOQKc71396V2LzqgD_bI3Z_GJkxUpw1OXhfMZkvZmVu7qniu4ZL8G7_YF01ENgUe4JdZLYdhmJdeheTUrk2VT9GrlOF2suoLRKnfaoprgsQXrF2ylRf_VLW7XEiPBlgFr4nbHjIoeDVZM8/s1600/Screen+Shot+2013-09-24+at+12.00.51+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="249" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOQKc71396V2LzqgD_bI3Z_GJkxUpw1OXhfMZkvZmVu7qniu4ZL8G7_YF01ENgUe4JdZLYdhmJdeheTUrk2VT9GrlOF2suoLRKnfaoprgsQXrF2ylRf_VLW7XEiPBlgFr4nbHjIoeDVZM8/s320/Screen+Shot+2013-09-24+at+12.00.51+PM.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8AmFyw1gCqjqBrGP4KYvxkppWh0j1A7anwH3UTniQroqr6RVU7qjxGafdp6epJIhuwla-Wf1PFOULC2qHZ-d6Tn-JC-80428T5uOnmZ4fI4bEKNIcoZGSGqw0e5_gFNznjqKjMLyOYFAd/s1600/Screen+Shot+2013-09-24+at+12.10.07+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8AmFyw1gCqjqBrGP4KYvxkppWh0j1A7anwH3UTniQroqr6RVU7qjxGafdp6epJIhuwla-Wf1PFOULC2qHZ-d6Tn-JC-80428T5uOnmZ4fI4bEKNIcoZGSGqw0e5_gFNznjqKjMLyOYFAd/s320/Screen+Shot+2013-09-24+at+12.10.07+PM.png" width="320" /></a></div>
<ul>
<li>3 TB/second system bandwidth.</li>
<li>1.4 TB/second memory bandwidth.</li>
<li>1 TB/second I/O bandwidth.</li>
<li>100x faster queries.</li>
<li>2x increase in transaction processing rates.</li>
<li>Analytics run faster on column format, fast acessing few columns, many rows.</li>
<li>Transactions run faster with row format, faster processing a few rows with many columns.</li>
<li>Oracle 12c: Stores data in both formats simultaneously, dual format in-memory database.</li>
<li>Scan billions of rows per second per CPU core.</li>
<li>In Memory Column store replaces analytic indexes.</li>
<li>Configure memory capacity</li>
<ul>
<li>inmemory_size = XXX GB</li>
<li>Alter table | partition ... inmemory;</li>
</ul>
<li>Transparent to SQL and applications.</li>
<li>Scale-Out In-Memory Database to any size.</li>
</ul>
</ul>
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-18091437677780132982013-09-20T00:12:00.003-06:002013-09-22T08:54:00.828-06:00Enterprise Data Movement at OOW<div class="p1">
Our Enterprise Data Movement presentation (UGF9722) on Sunday 12:30pm, September 22 at Oracle Open World 2013 is going to be a key presentation that anyone who works with Oracle data should attend.</div>
<div class="p1">
<br /></div>
<div class="p1">
Big Data is going to touch every Oracle DBA, BI, EDW and analytics team member. Top DBAs in the industry moved into Oracle RAC and then Oracle Exadata platforms. You are now seeing the top Oracle leaders in the industry moving into Hadoop. The reason is, they see the evolution of data and understand how Big Data platforms have become very strategic for every customer. The key skills top DBAs have such as infrastructure knowledge, data architecture, being able to perform bottleneck performance tuning, understand large data ingestion, designing a platform, understanding how to work with large data systems are all key foundational skills necessary for Hadoop. </div>
<div class="p2">
<br /></div>
<div class="p3">
<span class="s1">I am joined by Michael Schrader an internationally recognized data architect and BI strategist. Our primary goal is to provide vision and insight around Big Data transport strategies. The key information in this presentation is not in books. We will be sharing </span>successful strategies and lessons learned in the field.</div>
<div class="p3">
<span class="s1"><br /></span></div>
<div class="p3">
<span class="s1">This detailed presentation will discuss important considerations about data movement across Oracle data platforms and Hadoop. Attendees will see </span> how data flows between Oracle Database instances, Oracle data warehouses, Oracle Exadata, and Hadoop. Attendees will learn about enterprise data movement, data ingestion tools, metadata strategy to make sure you can use all your existing tools, understanding how to work with heterogenous sources and transports, and additional considerations for leveraging data in a logical data warehouse. We will discuss Hadoop, Oracle and 3rd party tools and the role they play in enterprise data movement. </div>
<div class="p3">
<br /></div>
<div class="p3">
The agenda is:</div>
<div class="p4">
</div>
<ul>
<li>The Role of Big Data and Hadoop</li>
<li>ETL Strategies and Reference Architecture</li>
<li>Common Processing Patterns</li>
<li>Logical Processing Tiers</li>
<li>Different data processing dimensions</li>
<li>Best Practices for Processing Data</li>
<li>Connectors for Hadoop</li>
</ul>
<br />
<div class="p4">
Hope to see you there.<br />
<br />
We've got very cool Hortonworks goodies for you.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguRNfDkpY8B3SaDO4g0ZXCl1tQ2oGV7M4nTuNT7eu4L36t4TStaBMGzWHO8ZWlxpngRC2EZg1v5ZHjrpmXMytC19c3HDXodUqt2TXGablSwAk4HD3AUucqKmhpuAztKwAYeApGa3hLIbU-/s1600/HWSandbox.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguRNfDkpY8B3SaDO4g0ZXCl1tQ2oGV7M4nTuNT7eu4L36t4TStaBMGzWHO8ZWlxpngRC2EZg1v5ZHjrpmXMytC19c3HDXodUqt2TXGablSwAk4HD3AUucqKmhpuAztKwAYeApGa3hLIbU-/s320/HWSandbox.jpg" width="240" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmgJPAGTwTS6wyUegq_i4gCc5S9FlYf7hGLPOVbh2wuFwx2Ii7zJiBAx5IOCAKuYcrLe0EvXGpBJqnhCQ6aqN1jXr1FVMVnczJJ6Si8V2bO1fZrgE2hJXjcTOSwLNvgq4UnPLyCP6kbNnF/s1600/HWWbottle.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmgJPAGTwTS6wyUegq_i4gCc5S9FlYf7hGLPOVbh2wuFwx2Ii7zJiBAx5IOCAKuYcrLe0EvXGpBJqnhCQ6aqN1jXr1FVMVnczJJ6Si8V2bO1fZrgE2hJXjcTOSwLNvgq4UnPLyCP6kbNnF/s320/HWWbottle.jpg" width="240" /></a></div>
<br /></div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-59180402852754945192013-08-28T09:50:00.001-06:002013-08-30T07:41:51.485-06:00VMWorld from a Hadoop Perspective<div class="MsoNoteLevel1" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
It's been an excellent week at VMWorld. I've been focusing on virtualizing Hadoop and business critical applications. My highlights:</div>
<div class="MsoNoteLevel2CxSpFirst" style="mso-list: l0 level2 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpMiddle" style="mso-list: l0 level2 lfo1;">
</div>
<ul>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Announcing the book "<a href="http://vmovereasy.com/successfully-virtualize-business-critical-oracle-databasessuccessfully-virtualize-business-critical-oracle-databasessuccessfully-virtualize-business-critical-oracle-databasessuccessfully-virtualize/">Successfully Virtualizing Business Critical Oracle Databases on VMware</a>", that I am writing with Charles Kim (ViscosityNA), Darryl Smith (EMC) and Steven Jones (VMware).</li>
<li>Will soon be announcing a new Hadoop book I will be writing.</li>
<li>Spent a lot of time this week with VMware big
data engineers and experts. Enjoyed the vExperts reception last night. Had some great conversations around virtualizing Hadoop.</li>
<li>Presenting best practices on virtualizing
Hadoop, Oracle and business critical applications.</li>
<li>Presenting on Virtualizing Mission Critical Oracle RAC with vSphere and vCOPs. This presentation is going to show VMware admins how to deploy Oracle Database as a Service without DBAs.</li>
</ul>
VMware's goal with the Software Defined Data Center (SDDC) is to take customers to 100% virtualization. VMware's vSAN and NSX will allow virtualization from compute to now include virtualization of networks and storage. VMware used to be about customers being 70% virtualized and moving the line to 75% or 80% virtualized. Now focused on moving customer to 100% virtualized which means aggressively virtualizing Business Critical Applications (Oracle, SAP, etc) and Hadoop. The SDDC is still a goal and more software pieces have to be put in to accomplish this goal.<br />
<ul>
<li>VMware's vSAN supports virtual storage directly in the hypervisor. </li>
<li>VMware's NSX is one of the biggest areas of interest. NSX is very strategic for VMware because it allows the virtualization of the entire network stack. NSX is to networking what ESXi is to virtualizing hardware resources. With NSX, switching, routing, bridging and firewalls are all part of the hypervisor. Here is the design pattern (source VMware). </li>
</ul>
<br />
<br />
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8mefx_QoEWctNfBOec5YM0Do0s1MuBmRufD8QELcObRpoKFErhjJPYthYUCtLUi6Yo-b0NVB2_RmF883IKR-pmwnyANgoUfJBrJYpMPb77DNjhQpnDfoooEpcPvgHcdqDsq5y1qo52qE6/s1600/NSXNetwork.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="186" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh8mefx_QoEWctNfBOec5YM0Do0s1MuBmRufD8QELcObRpoKFErhjJPYthYUCtLUi6Yo-b0NVB2_RmF883IKR-pmwnyANgoUfJBrJYpMPb77DNjhQpnDfoooEpcPvgHcdqDsq5y1qo52qE6/s320/NSXNetwork.png" width="320" /></a></div>
<br />
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
Additional Highlights<br />
<br />
<ul>
<li>VMWorld has gotten a lot bigger with an estimated 22,500 attendees.</li>
<li>Large vendor area had a lot of energy. Spunk had the coolest t-shirts by far.</li>
<li>Sharkk case seemed to be the coolest case for the iPad mini. Seen a lot more iPad minis than iPads among the VMware jet set.</li>
</ul>
</div>
</div>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
<br />
<b>Hadoop at VMWorld</b><br />
Hadoop presentations all start out discussing the benefits of Hadoop, use cases and then VMware's strategy around
virtualizing Hadoop (Serengeti and Big Data Extensions). Lots of use cases around the volume of semi structured and unstructured data. <span style="text-indent: 0in;"> </span><span style="text-indent: 0in;">Example: a single GE jet engine
produces 10 TB of data in hour. 90 Petabytes per year.</span></div>
<div class="MsoNoteLevel1CxSpLast" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpFirst" style="mso-list: l0 level2 lfo1;">
</div>
<ul>
<li>GE looking at all their fridges, all machines
they develop to call home for repairs and maintenance. </li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>GE focusing on early detection of faults, common
model failures and product engineering support.<span style="text-indent: 0in;"> </span></li>
</ul>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
As you would expect polling of
audiences showed almost no knowledge of Hadoop.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
<br /></div>
<div class="MsoNoteLevel1CxSpLast" style="mso-list: none; tab-stops: .5in;">
Key
benefits of virtualizing Hadoop from presentations:<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpFirst" style="mso-list: l0 level2 lfo1;">
</div>
<ul>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Fast provisioning of data nodes.</li>
<li>Workload consolidation</li>
<li>High Availability</li>
<li>Auto elasticity high resource utilization</li>
<li>True multi-tenancy</li>
<li>Promoting elastic compute with virtualizing data
nodes.</li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Leveraging virtual networks.</li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Leveraging ability to control noisy neighbors
with things like storage and network I/O control.</li>
</ul>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
VMware vCenter now has a plugin for
Hadoop (called Big Data Extensions).<o:p></o:p></div>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
Emphasizing if you virtualize you need
to use Serengiti for deployments if virtualizing Hadoop because Serengeti
understands VMware vCenter.<o:p></o:p></div>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
<br /></div>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
VMware is calling the Hadoop Virtual
Extensions (HVE) their Big Data Extensions.</div>
<div class="MsoNoteLevel1CxSpMiddle" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
<br /></div>
<div class="MsoNoteLevel1CxSpLast" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
Fed-Ex showed how they are using
scale-out NAS with Hadoop.</div>
<div class="MsoNoteLevel2CxSpFirst" style="mso-list: l0 level2 lfo1;">
<ul>
<li>Talked about how from their perspective if the
network is fast enough, data locality is not really that important.<span style="mso-spacerun: yes;"> </span></li>
<li>Talked about using Isilon storage with Hadoop.</li>
</ul>
</div>
<div class="MsoNoteLevel1CxSpLast" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
Identifed Inc. did an overview of
their Hadoop experience.<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpFirst" style="mso-list: l0 level2 lfo1;">
</div>
<ul>
<li>Started out using AWS. They were using 200 VMs and it was costing
them about $40k a month. When they
started running 24/7 the cost went up another $20-40k per month. They found that performance from AWS was very
spikey.</li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Moved on premise with Serengeti to reduce
costs. Moved to SupeNap in La
Vegas. They saved $20k a month by moving
off of AWS.</li>
<li>They got their ROI within two months of getting
off of AWS from a cost perspective.</li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>They decided to mix physical and virtual. Virtualized all master servers and stayed
physical with data nodes.</li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>They used a Fat Twin platform.</li>
<li><span style="font-family: 'Courier New';"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Use anti-affinity rules for master servers. Especially zookeeper and journal nodes.</li>
<li>They used mixed storage. Used flash for OS on nodes, and for data
nodes local storage.</li>
<li>They are now exploring virtualizing their data
nodes and separating compute and data.
So tasktrackers will be separate from datanodes. They want to have elastic compute.</li>
<li>They do not have anyone that is a Hadoop
administrator. They have different
developers rotate into the infrastructure team for 3-6 months. Then when developers rotate back to
development teams they keep their permissions and can manage their Hadoop
clusters within the individual developer teams.</li>
</ul>
<div class="MsoNoteLevel1CxSpLast" style="margin-left: 0in; mso-add-space: auto; mso-list: l0 level1 lfo1; text-indent: 0in;">
<b>VMware announced vSphere 5.5, here are
a few highlights:</b><o:p></o:p></div>
<div class="MsoNoteLevel2CxSpFirst" style="mso-list: l0 level2 lfo1;">
</div>
<ul>
<li>With ESXi 5.5, the hypervisor supports up of 320 logical cores (5.1 supports 160 logical cores).</li>
<li>Up to 4TB of memory for an ESXi host (5.1 supports 2TB of memory).</li>
<li>Fault tolerance can now support up to four
vCPUs. Means VMware will be pushing for
this method to achieve HA with Hadoop master servers versus the new Hadoop HA
features in Hadoop 2.0.</li>
<li>NUMA Nodes per host 16 (was 8)</li>
<li><span style="text-indent: 0in;">Things coming: </span><span style="font-family: Symbol; text-indent: 0in;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span><span style="text-indent: 0in;">Auto elastic Hadoop, Support of YARN in future, Support of HBase in future.</span></li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>vSphere 5.5 now supports application high availability. This supports application recovery within a VM.</li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Project Serengeti" tools support Hadoop deployments. Not sure if this will be in the first release of 5.5 or not.</li>
<li>VMDK
maximum size to 62 TB.</li>
<li><span style="font-family: Symbol;"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Misc:</li>
<ul>
<li>No change in the pricing of vSphere editions.</li>
<li>4 new features (click on the link for many
details): <span style="color: black;"><a href="http://up2v.nl/2013/08/26/an-introduction-to-vmware-appha/"><span style="color: windowtext; mso-bidi-font-weight: bold;">AppHA</span></a>, Reliable Memory, <a href="http://up2v.nl/2013/08/26/introduction-of-vmware-vsphere-flash-read-cache/"><span style="color: windowtext; mso-bidi-font-weight: bold;">Flash Read Cache</span></a> and <a href="http://up2v.nl/2013/08/26/an-intro-to-vmware-vsphere-big-data-extensions/"><span style="color: windowtext; mso-bidi-font-weight: bold;">Big Data Extensions</span></a></span></li>
<li>Latency-sensitivity feature for applications
like very high performance computing and stock trading apps</li>
<li>vSphere Hypervisor (free) has no physical memory
limit anymore (was 32 GB)</li>
<li>PCI hotplug support for SSD</li>
<li>VMFS heap size improvements</li>
<li>16 GB End to end Fibre channel. So 16 GB from
host to switch and 16 GB from switch to SAN</li>
<li>Support for 40 Gpbs NICs<span style="font-family: 'Courier New';"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>Enhanced IPv6 support</li>
<li>Enhancements for CPU C-states. This reduces
power consumption.</li>
<li>Expanded vGPU support: In vSphere 5.1 VMware
only supports NVIDIA GPU. </li>
<li>Support for the Ivy Bridge-EP Xeon E5 v2 processors (Intel) and the Opteron 3300,4300 and 6300 processors (Advanced Micro Devices). </li>
<li>The ability to vMotion a virtual machine between
different GPU vendors is also supported. If hardware mode is enabled in the
source host and GPU does not exist in the destination host, vMotion will fail
and will not attempt a vMotion.<span style="font-family: 'Courier New';"><span style="font-family: 'Times New Roman'; font-size: 7pt;"> </span></span>added Microsoft Windows Server 2012 guest
clustering support</li>
<li>AHCI Controller Support which enables Mac OS
guests to use IDE CDROM drives. AHCI is an operating mode for SATA.</li>
</ul>
</ul>
<o:p></o:p><br />
<div class="MsoNoteLevel2CxSpMiddle" style="mso-list: l0 level2 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpMiddle" style="mso-list: l0 level2 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpMiddle" style="mso-list: l0 level2 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpMiddle" style="mso-list: l0 level2 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel2CxSpLast" style="mso-list: l0 level2 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpFirst" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<div class="MsoNoteLevel3CxSpMiddle" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
<!--[if gte mso 9]><xml>
<o:DocumentProperties>
<o:Revision>0</o:Revision>
<o:TotalTime>0</o:TotalTime>
<o:Pages>1</o:Pages>
<o:Words>1180</o:Words>
<o:Characters>6726</o:Characters>
<o:Company>Hortonworks</o:Company>
<o:Lines>56</o:Lines>
<o:Paragraphs>15</o:Paragraphs>
<o:CharactersWithSpaces>7891</o:CharactersWithSpaces>
<o:Version>14.0</o:Version>
</o:DocumentProperties>
<o:OfficeDocumentSettings>
<o:AllowPNG/>
</o:OfficeDocumentSettings>
</xml><![endif]-->
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>JA</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
<w:UseAsianBreakRules/>
<w:DontGrowAutofit/>
<w:SplitPgBreakAndParaMark/>
<w:EnableOpenTypeKerning/>
<w:DontFlipMirrorIndents/>
<w:OverrideTableStyleHps/>
<w:UseFELayout/>
</w:Compatibility>
<m:mathPr>
<m:mathFont m:val="Cambria Math"/>
<m:brkBin m:val="before"/>
<m:brkBinSub m:val="--"/>
<m:smallFrac m:val="off"/>
<m:dispDef/>
<m:lMargin m:val="0"/>
<m:rMargin m:val="0"/>
<m:defJc m:val="centerGroup"/>
<m:wrapIndent m:val="1440"/>
<m:intLim m:val="subSup"/>
<m:naryLim m:val="undOvr"/>
</m:mathPr></w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
DefSemiHidden="true" DefQFormat="false" DefPriority="99"
LatentStyleCount="276">
<w:LsdException Locked="false" Priority="0" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Normal"/>
<w:LsdException Locked="false" Priority="9" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="heading 1"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 2"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 3"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 4"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 5"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 6"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 7"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 8"/>
<w:LsdException Locked="false" Priority="9" QFormat="true" Name="heading 9"/>
<w:LsdException Locked="false" Priority="39" Name="toc 1"/>
<w:LsdException Locked="false" Priority="39" Name="toc 2"/>
<w:LsdException Locked="false" Priority="39" Name="toc 3"/>
<w:LsdException Locked="false" Priority="39" Name="toc 4"/>
<w:LsdException Locked="false" Priority="39" Name="toc 5"/>
<w:LsdException Locked="false" Priority="39" Name="toc 6"/>
<w:LsdException Locked="false" Priority="39" Name="toc 7"/>
<w:LsdException Locked="false" Priority="39" Name="toc 8"/>
<w:LsdException Locked="false" Priority="39" Name="toc 9"/>
<w:LsdException Locked="false" Priority="35" QFormat="true" Name="caption"/>
<w:LsdException Locked="false" Priority="10" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Title"/>
<w:LsdException Locked="false" Priority="1" Name="Default Paragraph Font"/>
<w:LsdException Locked="false" Priority="11" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtitle"/>
<w:LsdException Locked="false" Priority="22" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Strong"/>
<w:LsdException Locked="false" Priority="20" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Emphasis"/>
<w:LsdException Locked="false" Priority="59" SemiHidden="false"
UnhideWhenUsed="false" Name="Table Grid"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Placeholder Text"/>
<w:LsdException Locked="false" Priority="1" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="No Spacing"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 1"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 1"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 1"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 1"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 1"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 1"/>
<w:LsdException Locked="false" UnhideWhenUsed="false" Name="Revision"/>
<w:LsdException Locked="false" Priority="34" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="List Paragraph"/>
<w:LsdException Locked="false" Priority="29" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Quote"/>
<w:LsdException Locked="false" Priority="30" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Quote"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 1"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 1"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 1"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 1"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 1"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 1"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 1"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 1"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 2"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 2"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 2"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 2"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 2"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 2"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 2"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 2"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 2"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 2"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 2"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 2"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 2"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 2"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 3"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 3"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 3"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 3"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 3"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 3"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 3"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 3"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 3"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 3"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 3"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 3"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 3"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 3"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 4"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 4"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 4"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 4"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 4"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 4"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 4"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 4"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 4"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 4"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 4"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 4"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 4"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 4"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 5"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 5"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 5"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 5"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 5"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 5"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 5"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 5"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 5"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 5"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 5"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 5"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 5"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 5"/>
<w:LsdException Locked="false" Priority="60" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Shading Accent 6"/>
<w:LsdException Locked="false" Priority="61" SemiHidden="false"
UnhideWhenUsed="false" Name="Light List Accent 6"/>
<w:LsdException Locked="false" Priority="62" SemiHidden="false"
UnhideWhenUsed="false" Name="Light Grid Accent 6"/>
<w:LsdException Locked="false" Priority="63" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 1 Accent 6"/>
<w:LsdException Locked="false" Priority="64" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Shading 2 Accent 6"/>
<w:LsdException Locked="false" Priority="65" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 1 Accent 6"/>
<w:LsdException Locked="false" Priority="66" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium List 2 Accent 6"/>
<w:LsdException Locked="false" Priority="67" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 1 Accent 6"/>
<w:LsdException Locked="false" Priority="68" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 2 Accent 6"/>
<w:LsdException Locked="false" Priority="69" SemiHidden="false"
UnhideWhenUsed="false" Name="Medium Grid 3 Accent 6"/>
<w:LsdException Locked="false" Priority="70" SemiHidden="false"
UnhideWhenUsed="false" Name="Dark List Accent 6"/>
<w:LsdException Locked="false" Priority="71" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Shading Accent 6"/>
<w:LsdException Locked="false" Priority="72" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful List Accent 6"/>
<w:LsdException Locked="false" Priority="73" SemiHidden="false"
UnhideWhenUsed="false" Name="Colorful Grid Accent 6"/>
<w:LsdException Locked="false" Priority="19" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Emphasis"/>
<w:LsdException Locked="false" Priority="21" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Emphasis"/>
<w:LsdException Locked="false" Priority="31" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Subtle Reference"/>
<w:LsdException Locked="false" Priority="32" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Intense Reference"/>
<w:LsdException Locked="false" Priority="33" SemiHidden="false"
UnhideWhenUsed="false" QFormat="true" Name="Book Title"/>
<w:LsdException Locked="false" Priority="37" Name="Bibliography"/>
<w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
</w:LatentStyles>
</xml><![endif]-->
<!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0in 5.4pt 0in 5.4pt;
mso-para-margin:0in;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->
<!--StartFragment-->
<!--EndFragment--><br />
<div class="MsoNoteLevel3CxSpLast" style="mso-list: l0 level3 lfo1;">
<o:p></o:p></div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-142261724013142582013-08-16T13:06:00.000-06:002013-08-16T13:08:54.244-06:00Introducing the BigDataOverEasy LinkedIn GroupI've been getting a lot of pings on the #BigDataOverEasy group that just started and I wanted to introduce it to you. I've been involved with leading edge companies i.e. MySQL, Sun Microsystems, Oracle, VMware, Hortonworks as well as been a part of their user communities. I've also been directly involved in leaderships around strategic councils, beta leadership programs as well as recognized industry expert programs such as Oracle ACE, Sun Ambassadorship and VMware vExpert.<br />
<br />
As Big Data and Hadoop generate more momentum in the industry I want to be involved in a group that is focused on skill development, sharing and exchange, reference architectures, best practices, industry directions and mind share. So I see tremendous value around having a group that can span companies, user communities and regions for this knowledge sharing and being able to communicate with a broader audience.<br />
<br />
The name BigDataOverEasy explains the goal of the group. To be a group focused on Big Data and related technologies, i.e. Hadoop that makes it easier to learn and grow without having to go into the coal mines to extract knowledge or reinvent the wheel. I'd like this group to be about Data Architects, Platform Architects, Administrators, Business Intelligence and Analytics teams and developers who want to share knowledge and experiences and have a group of peers they can exchange ideas with.<br />
<br />
This group will not support any recruiters, sales or spamming. Here are some of the ecosystems I will reach out to, to find like minded big data and Hadoop enthusiasts.<br />
<ul>
<li>Hortonworks Data Platform</li>
<li>Rackspace</li>
<li>Microsoft</li>
<li>Teradata</li>
<li>Oracle</li>
<li>MySQL</li>
<li>Red Hat</li>
</ul>
We decided to call this Special Interest Group BigDataOverEasy even though it will focus on Hadoop because we want to emphasize that Hadoop is about data.<br />
<br />Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-17981069332012284262013-08-14T08:18:00.000-06:002013-08-14T13:40:44.865-06:00Hadoop Reference ArchitecturesSome initial key factors for success when building a Hadoop cluster is to build a solid foundation. This includes:<br />
<div>
<ul>
<li><b>Selecting the right hardware. </b> In working with a hardware vendor make sure you are working from their hardware compatibility lists (HCL) and you are making the right decisions for your cluster. Commodity hardware does not have to be generic. You can select commodity hardware that is customized for running Hadoop.</li>
<li><b>Build an enterprise OS platform.</b> Whether you are using Linux or Windows, customize and tune your operating system using enterprise best practices and standards for Hadoop.</li>
<li><b>Design your Hadoop clusters using reference architectures. </b> Don't reinvent the wheel unless you have to. Vendors are publishing Hadoop reference architectures that give you a great starting place.</li>
</ul>
</div>
<div>
I've included a few HDP reference architectures to give you a feel for what a Hadoop platform may look like.</div>
<div>
<ul>
<li><a href="http://h20195.www2.hp.com/v2/GetPDF.aspx%2F4AA4-7057ENW.pdf">HP reference architecture</a></li>
<li><a href="http://www.rackspace.com/knowledge_center/article/apache-hadoop-on-rackspace-private-cloud">Rackspace reference architecture</a></li>
<li><a href="http://hortonworks.com/wp-content/uploads/2012/06/StackIQ_Ref_Architecture_WPP_v2.pdf">StackIQ reference architecture</a></li>
<li><a href="http://www.cisco.com/en/US/docs/unified_computing/ucs/UCS_CVDs/flexpod_hadoop_hortonworks.html">Cisco UCS reference architecture</a></li>
</ul>
</div>
<div>
I'd like to keep building this reference architecture list.</div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-2787970825209677842013-08-12T09:21:00.001-06:002013-08-13T13:13:07.780-06:00A Changing Era for Oracle DBAsThroughout my career in IT I've always tried to stay on the leading edge of technology. During my journeys I have seen three evolutions of Oracle DBAs and I now see we are about to enter a fourth era for Oracle DBAs. The eras up to this point have been:<br />
<ol>
<li><b>In The Land of The Blind, The One Eyed Man is King </b>- This was during the early releases of Oracle from Version 4 through Version 6. Relational databases and Oracle were relatively new to the industry so if you had any common sense about IT, Oracle technology or relational database concepts you were worth your weight in silver because Oracle was a growing technology and market.</li>
<li><b>The Speeds and Feeds DBA </b>- This was the time between Oracle 7 and Oracle 10g. Very technical DBAs that understood the internals of Oracle, infrastructure, tuning, backup and recovery, RAC, Streams, Data Guard were not only hard to find but were worth their weight in gold. This in-depth level of knowledge came from reading the source code and/or working endlessly to learn the internals of how to maximize Oracle in the infrastructure. These very technical DBAs made great careers out of their knowledge. I think of this as the golden age of DBAs. </li>
<li><b>The G DBAs </b>- This is the time of the Google and GUI DBAs. These DBAs are a product of the new evolving Oracle environment and successes of the Oracle product. Oracle software can now identify problems, fix those problems and perform very detailed operations at the click of a button. There are also tremendous amounts of books, whitepapers, blog posts, tutorials that can teach someone a tremendous amount about Oracle in a short time that used to take years of experience and effort to acquire. So now you have a large percentage of these GUI and Google DBAs that can perform work that previously required highly skilled experts with years of experience. One thing about the GUI and Google DBAs is they can be much easier to replaced and outsourced. Don't get me wrong, the Speeds and Feeds DBAs are still needed but not as much and there are less of them around every year.</li>
<li><b>The Platform DBA</b> - If the Speeds and Feeds era was the golden age, then this is the platinum age. The top DBAs in the world today are not only Oracle experts but also infrastructure experts. They are also recognized experts in areas such as architecture, design, storage, networking, applications and business. When you look at the top Oracle RAC, Exadata, ASM, GoldenGate, MAA DBAs they are not only experts in a specific Oracle domain but also in the environment surrounding Oracle. So who are the Oracle DBAs that are going to dominate the market place and be worth their weight in platinum in the next few years? It's going to be the platform DBAs. The Platform DBAs are Oracle experts who also understand areas such as the Cloud, Enterprise Virtualization Platforms, Big Data, Enterprise Data Management and the business.</li>
</ol>
<div>
In every company I meet with I see that structured data is going to significantly be the smallest percentage of data that companies need to manage. Unstructured and semi-structured data is going to dwarf structured data (traditional databases and warehouses). This is a consistent industry perspective and is reiterated by industry analysts. DBAs need to be data experts not just Oracle experts. In the next few years, we will see:</div>
<div>
<ul>
<li>Oracle as a service will increase and we will see more movement into the cloud.</li>
<li>Oracle environments will be interacting more with Big Data environments.</li>
<li>Oracle tier one environments will increasingly be virtualized.</li>
<li>Oracle business applications still continue to dominate the market and application DBAs who understand the business applications are continuing to increase in demand.</li>
</ul>
<div>
Every new era in Oracle and the IT industry creates tremendous opportunity for those with the drive, energy and who see those opportunities. I look forward to interacting and sharing knowledge, experiences and wisdom as we move into this new era.</div>
</div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-51157576631753044992013-06-29T01:28:00.001-06:002013-06-29T02:31:38.902-06:00Weaknesses in Traditional Data PlatformsEveryone understands that Hadoop brings high performance commercial computing to organizations using relatively low cost commodity storage. What is accelerating the move to Hadoop are weaknesses in traditional relational and data warehouse platforms in meeting today's business needs. Some key weaknesses of traditional platforms include:<br />
<br />
<ul>
<li>Late binding with schemas greatly increase the latency between receiving new data sources and deriving business value from this data. </li>
<li>The significantly high cost and complexity of SAN storage. This high cost forces organizations to aggregate and remove a lot of data that contains high business value. Important details and information are getting thrown out or hidden in aggregated data.</li>
<li>The complexity of working with semi-structured and unstructured data.</li>
<li>The incredible cost, complexity and ramifications of maintaining database administrators, storage and networking teams in traditional platforms. There are lots of silos of expertise and software required in traditional environments that have dramatic effects on agility and cost. It's gotten to the point that vendors are now delivering extremely expensive engineered systems to deal with the complexity of these silos. These expensive engineered systems require even more specialized expertise to maintain and make customers ever more dependent on the vendors. What's funny is you hear the old phrase "one throat to choke but it's the customer whose choking on the cost. With Hadoop's self-healing and fault tolerance a small team can manage thousands of servers. A single Hadoop administrator can manage 1000 - 3000 nodes all on relatively inexpensive commodity hardware.</li>
</ul>
<div>
While the above highlights the need for Hadoop, it's also important to understand traditional relational databases and data warehouses still have the same role and are needed. A relational database provides a completely different function that a Hadoop cluster. Also, a company is not going to throw out all their existing data warehouses or the expertise and reporting they've built around them. Hadoop today is usually used to add new capabilities to an enterprise data environment not replace existing platforms. <br />
<br />
The old line of no one ever gets fired for buying IBM is a thing of the past with Hadoop. An entire organization may go under if your competition is effectively using big data and you are not. Hadoop is the most disruptive technology since the .com days. </div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-51149669386492266762013-06-28T08:24:00.000-06:002013-06-29T10:38:43.362-06:00Hadoop Summit 2013 in San JoseIt has been a privilege to present at the Hadoop Summits this year in Amsterdam and San Jose. This week was one of the best networking weeks I've ever had at a conference. Great seeing all my Oracle, VMware, Rackspace and MySQL friends as well as meeting a lot of new friends in the Hadoop ecosystem. <br />
<br />
Key takeaways:<br />
<ul>
<li>Hadoop's disruption of the IT industry is accelerating.</li>
<li>Hadoop 2.0 will significantly increase enterprise adoption.</li>
<li>YARN is the distributing operating system of the future.</li>
<li>Incredible success stories of the ROI around Hadoop.</li>
<li>Open source community is about innovation, community and sharing.</li>
<li>Lots of analytics software competing to run on Hadoop. This will be the big battleground.</li>
<li>Hortonworks reinforces it's innovation and leadership in defining the roadmap for Hadoop.</li>
<li>Hortonworks constantly demonstrated their platform expertise with Hadoop. </li>
<li>Hadoop is a high performance commercial computing environment.</li>
</ul>
Two coolest things I liked at the conference:<br />
<br />
<ul>
<li>An 8-node Raspberry PI Hadoop cluster.</li>
<li>Creating a multi-node (VM) Hadoop cluster on your laptop using the Hortonworks Sandbox.</li>
</ul>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifLNCmdWXL_J8S8FN3pFu6X_qSOWLvVHEpGYueBWBwIcZCrSfi6pVf0LPYRJj5MDBEeW0hV7kV8qYe2y0zxHI7_dhaFmppvnTOHX3SxZI0vedtmhwnuwaEu9JBtaYWVRf8Q2qD-EW3CW4A/s1600/IMG_1047.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifLNCmdWXL_J8S8FN3pFu6X_qSOWLvVHEpGYueBWBwIcZCrSfi6pVf0LPYRJj5MDBEeW0hV7kV8qYe2y0zxHI7_dhaFmppvnTOHX3SxZI0vedtmhwnuwaEu9JBtaYWVRf8Q2qD-EW3CW4A/s400/IMG_1047.jpg" width="400" /></a></div>
<br />
I was also able to barter for a Yahoo soccer ball (all it cost me was a Hortonworks water bottle). :)<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhF82HdnwLRsZBLRsXQHIsTVhkRTcEW1A8fQABr07d3Jd2Mp2DTf1_r8jqIaABhnc9BFGiiQvvhXsEPU3Hv7enwyEBKoVPAj6KxLLuBCn77leNg8Cren6vZYahELbeP6G8HiYxrV-fk7DLY/s1600/IMG_1048.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhF82HdnwLRsZBLRsXQHIsTVhkRTcEW1A8fQABr07d3Jd2Mp2DTf1_r8jqIaABhnc9BFGiiQvvhXsEPU3Hv7enwyEBKoVPAj6KxLLuBCn77leNg8Cren6vZYahELbeP6G8HiYxrV-fk7DLY/s320/IMG_1048.jpg" width="240" /></a></div>
<br />Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-19210024519175378772013-06-26T11:12:00.002-06:002013-06-29T10:46:02.941-06:00Hadoop - It's All About The DataA key point to understand about Hadoop is that it's all about the data. Don't lose focus. It's easy to get hung up on Hive, Pig, HBase, HCatalog and lose sight of designing the right data architecture. Also, if you have a strong background in data warehouse design, BI, analytics, etc. all those skills are transferable to Hadoop. Hadoop just takes data warehousing to new levels of scalability and agility with reduction of business latency while working with data sets ranging from structured to unstructured data. Hadoop 2.0 and YARN are going to move Hadoop deep into the enterprise and allow organizations to make faster and more accurate business decisions. The ROI of Hadoop is multiple factors higher than the traditional data warehouse. Companies should be extremely nervous about being out Hadooped by their competition.<br />
<br />
Newbies often look at Hadoop with wide eyes versus understanding that Hadoop has a lot of components to it that they already understand such as: clustering, distributed file systems, parallel processing, batch and stream processing.<br />
<br />
A few key success factors for a Hadoop project are:<br />
<ul>
<li>Start with a good data design using a scalable reference architecture.</li>
<li>Building successful analytical models that provide business value.</li>
<li>Be aggressive in reducing the latency between data hitting the disk and leveraging business value from that data. </li>
</ul>
The ETL strategies and data set generation for Hadoop is similar to what you are going to want to do in your Hadoop cluster. It's important to look at your data warehouse and understanding how your enterprise data strategy is going to evolve with Hadoop now a part of the ecosystem.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU9UO9MCqxKigIr77xQSB12wb7cI8tmN6I1Zibrmcf74lg09IXYosexQeUZmsyBkJ9cAc67-q8EGYq3YXHr1iMJg_hP793Yp1MiPiKTKHlE7wHbtdGvYViB7U9cK4oRq-fQeqkdfQQCObR/s1524/Screen+Shot+2013-06-28+at+6.26.06+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="295" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU9UO9MCqxKigIr77xQSB12wb7cI8tmN6I1Zibrmcf74lg09IXYosexQeUZmsyBkJ9cAc67-q8EGYq3YXHr1iMJg_hP793Yp1MiPiKTKHlE7wHbtdGvYViB7U9cK4oRq-fQeqkdfQQCObR/s400/Screen+Shot+2013-06-28+at+6.26.06+AM.png" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
</div>
<div class="p1">
"Hadoop cannot be an island, it must integrate with Enterprise Data Architecture". - HadoopSummit</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj59gomNIsQpYCd64J5mXJdYV7Of2E1BZYKisiKKZqHYRs431yEfK1qjexUURR_9suMX0AfnQEfrKKfBlQIB3OK3daadg8E6EFBnkk5E2tqGRZ_Ygqz5_rcN7RGgURkIN9pbAB6upLuvVmP/s1588/Screen+Shot+2013-06-28+at+6.26.22+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj59gomNIsQpYCd64J5mXJdYV7Of2E1BZYKisiKKZqHYRs431yEfK1qjexUURR_9suMX0AfnQEfrKKfBlQIB3OK3daadg8E6EFBnkk5E2tqGRZ_Ygqz5_rcN7RGgURkIN9pbAB6upLuvVmP/s400/Screen+Shot+2013-06-28+at+6.26.22+AM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSxK9812_WUZtKqL4dJAr3kWO_GSNbwz25oWV6ZO97t9eax95ed0bPcXmTk8a7j87u2YnegUEftQDLRSSrDpYkp-29pqYP4KXhaTHGrHc_Ge3Eu1s7Sl9QyKdiTtv_lmRWpkJktFnocLoT/s1600/Screen+Shot+2013-06-26+at+10.01.56+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="290" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSxK9812_WUZtKqL4dJAr3kWO_GSNbwz25oWV6ZO97t9eax95ed0bPcXmTk8a7j87u2YnegUEftQDLRSSrDpYkp-29pqYP4KXhaTHGrHc_Ge3Eu1s7Sl9QyKdiTtv_lmRWpkJktFnocLoT/s400/Screen+Shot+2013-06-26+at+10.01.56+AM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
"Apache Hadoop is a <b>set</b> of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network." - Merv Adrian at Gartner Research</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkXu5kOJ4EY7xcj3Nr4rBjFqsVHjyZJRscrIVBYlzA3qdnAtrmmrttk6PHeHUr9gTAXRgZ2hcouRXMhVgMNknCw7RFlHANwoEe45Ahpxcn-wXnphF0H3zT_1foQlUUTSZeUAecCqKtc0D_/s1600/Screen+Shot+2013-06-26+at+10.02.40+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="262" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgkXu5kOJ4EY7xcj3Nr4rBjFqsVHjyZJRscrIVBYlzA3qdnAtrmmrttk6PHeHUr9gTAXRgZ2hcouRXMhVgMNknCw7RFlHANwoEe45Ahpxcn-wXnphF0H3zT_1foQlUUTSZeUAecCqKtc0D_/s400/Screen+Shot+2013-06-26+at+10.02.40+AM.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
This is a sample Hadoop 1.x cluster so you can see the key software processes that make up Hadoop. The good point of this diagram is that if you understand it you are probably worth another $20-30k. :) </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBqXub-Wi7ua79ALU9E2bqIYoCoBBYUCmkceoJKiA_f1MUNIAPRyXvAU7yWUggop6wn9ulWQpPCpPK9oO49m3-MVHTA0cE10PY_og-WQoaZudamnOXNxf84OT-0vXK8Don1pWqqzmXrhyphenhyphenh/s1352/Screen+Shot+2013-06-28+at+6.18.55+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBqXub-Wi7ua79ALU9E2bqIYoCoBBYUCmkceoJKiA_f1MUNIAPRyXvAU7yWUggop6wn9ulWQpPCpPK9oO49m3-MVHTA0cE10PY_og-WQoaZudamnOXNxf84OT-0vXK8Don1pWqqzmXrhyphenhyphenh/s640/Screen+Shot+2013-06-28+at+6.18.55+AM.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhycGy7FvVbJuS_9kWkZu-0oTYKTKBTwtjVnaKp9F7yRuqOpYai2dvDYfg-YNyCEV2L94YXqlRc_nYWgjp_s-inkZgi7aD-J2Jo46-kZU0FoV3E22mk1jQCUhDF9WA1hKaLbEEyXWzXjIIo/s1600/Screen+Shot+2013-06-26+at+10.55.26+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="452" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhycGy7FvVbJuS_9kWkZu-0oTYKTKBTwtjVnaKp9F7yRuqOpYai2dvDYfg-YNyCEV2L94YXqlRc_nYWgjp_s-inkZgi7aD-J2Jo46-kZU0FoV3E22mk1jQCUhDF9WA1hKaLbEEyXWzXjIIo/s640/Screen+Shot+2013-06-26+at+10.55.26+AM.png" width="640" /></a></div>
<br />
<span style="background-color: white; color: #454545; font-family: 'Helvetica Neue', Helvetica, Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px;">YARN (Hadoop 2.0) is the distributed operating system of the future. YARN allows you to run multiple applications in Hadoop, all sharing a common resource management. YARN is going to disrupt the data industry to a level not seen since the .com days. </span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbdmnJzBoDWg9gnO_Kr1Qf5w8jZgtNdArDiOaijLvsTEec1_gSzXxuSM-XHxrWweKoGVLfE7MjV7TSrc7zTDwLeiPDch_ih4Khwg3ba8AsC39o2-NOO4plgjJZ2XTiLJ36qmDUVMIGLkOv/s1324/Screen+Shot+2013-06-28+at+5.03.50+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="235" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhbdmnJzBoDWg9gnO_Kr1Qf5w8jZgtNdArDiOaijLvsTEec1_gSzXxuSM-XHxrWweKoGVLfE7MjV7TSrc7zTDwLeiPDch_ih4Khwg3ba8AsC39o2-NOO4plgjJZ2XTiLJ36qmDUVMIGLkOv/s640/Screen+Shot+2013-06-28+at+5.03.50+AM.png" width="640" /></a></div>
<br />
A Hadoop cluster will usually have multiple data layers. <br />
<br />
<ul>
<li>Batch Layer: Raw data is loaded into a data set that is immutable so it becomes your source of truth. Data scientists and analysts can start working with this data as soon as it hits the disk. </li>
<li>Serving Layer: Just as in a traditional data warehouse, this data is often massaged, filtered and transformed into a data set that is easier to do analytics on. Unstructured and semi-structured data will be put into a data set that is easier to work with. Metadata is then attached to this data layer using HCatalog so users can access the data in the HDFS files using abstract table definitions. </li>
<li>Speed Layer: To optimize the data access and performance often additional data sets (views) are calculated to create a speed layer. HBase can be used for this layer dependent on the requirements.</li>
</ul>
<br />
<br />
This diagram emphasizes two key points:<br />
<ul>
<li>The different data layers you will have in your Hadoop cluster.</li>
<li>The importance of building your metadata layer (HCatalog).</li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2Xmo_bVOxKWRjikmyNJomOWZgbihV1QZt_QAjFJNm9DrMbSLQRoQq8c-sfK1BWXh5ep13RX-TQFVKjrTOX6jS-X0cM2HqUc4fINGeVON1XFkhn4XhZT0Dgc72KqKl927QyQmiYOfZ3OpI/s1600/Screen+Shot+2013-06-28+at+5.22.50+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2Xmo_bVOxKWRjikmyNJomOWZgbihV1QZt_QAjFJNm9DrMbSLQRoQq8c-sfK1BWXh5ep13RX-TQFVKjrTOX6jS-X0cM2HqUc4fINGeVON1XFkhn4XhZT0Dgc72KqKl927QyQmiYOfZ3OpI/s640/Screen+Shot+2013-06-28+at+5.22.50+AM.png" width="640" /></a></div>
With the massive scalability of Hadoop, you need to be able to automate as much as possible and manage the data in your cluster. This is where Falcon is going to play a key role. Falcon is a data lifecycle management framework that provides the data orchestration, disaster recovery as well as data retention you need to manage your data.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6AeCDJOWEykyG1FLi43uqcNwe6JKNTm5gIyJ4CBMzlb4ah-ie-ocPAm5MSCoTSYfF2Y9Sq3zB7JpjKjdclTKkjClFtKgnHYjUv2dfH_ADuKX6uRa9D9R31-RWjduujiCDorI0hRyd1f54/s1346/Screen+Shot+2013-06-28+at+5.29.39+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="424" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6AeCDJOWEykyG1FLi43uqcNwe6JKNTm5gIyJ4CBMzlb4ah-ie-ocPAm5MSCoTSYfF2Y9Sq3zB7JpjKjdclTKkjClFtKgnHYjUv2dfH_ADuKX6uRa9D9R31-RWjduujiCDorI0hRyd1f54/s640/Screen+Shot+2013-06-28+at+5.29.39+AM.png" width="640" /></a></div>
<span style="background-color: white; color: #454545; font-family: 'Helvetica Neue', Helvetica, Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px;"><br /></span>Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-19493967366365299592013-06-26T10:31:00.004-06:002013-06-26T11:18:12.424-06:00Hadoop Summit Keynote - San Jose, 2013I wanted to share key thoughts from the keynote at Hadoop Summit 2013 in San Jose.<br />
<br />
<span style="color: blue;"><b>Merv Adrian at Gartner Research</b></span><br />
"Apache Hadoop is a <b>set</b> of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network."<br />
<br />
<div class="p1">
Traditional IM</div>
<div class="p1">
</div>
<ul>
<li>Requirements based</li>
<li>Top-down design</li>
<li>Integration and reuse</li>
<li>Technology/consolidation</li>
<li>World of DW and ECM</li>
<li>Competence centers</li>
<li>Better decisions</li>
<li>commercial software</li>
</ul>
<div class="p1">
Big Data Style</div>
<div class="p1">
</div>
<ul>
<li>Opportunity Oriented</li>
<li>Bottom-up experimentation</li>
<li>Immediate use</li>
<li>Toll proliferation</li>
<li>World of Hadoop</li>
<li>Hackathons</li>
<li>Better business</li>
<li>Open source</li>
</ul>
<div class="p1">
<br /></div>
<div class="p1">
<br /></div>
<div class="p1">
Which Projects Are "Hadoop"? Minimum set from Apache website:</div>
<div class="p1">
</div>
<ul>
<li>Apache HDFS</li>
<li>Apache MapReduce</li>
<li>Apache Yarn</li>
<li>Other independent Apache projects: Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, ZooKeeper</li>
</ul>
<div class="p1">
Rich, Complex Set of Functional Choices</div>
<div class="p1">
</div>
<ul>
<li>Ingest/Propagate</li>
<li>Describe, Develop</li>
<li>Compute, Search</li>
<li>Persist</li>
</ul>
<div class="p1">
<b>Ingest/Propagate:</b><br />
Apache Flume, Apache Kafka, APache Sqoop, HDFS, NFS, Information HParser, DBMS vendor utilities, Talend, WebHDFS</div>
<div class="p1">
<br /></div>
<div class="p1">
<b>Describe, Develop: </b></div>
<div class="p1">
Apache Crunch, Apache Hive, Apache Pig, Apache Tika, Cascading, Cloudera Hue, DataFu, Dataguise, IBM Jaql</div>
<div class="p1">
<br /></div>
<div class="p1">
<b>Compute, Search:</b></div>
<div class="p1">
Apache Blur, Apache Drill, Apache Giraph, Apache Hama, APache Lucene, Apache MapReduce, Apache Solr, Cloudera Impala, HP HaVEn, IBM BIgSQL, IBM InfoSphere Streams, HStreaming, Pivotal HAWQ, SQLstream, Storm, Teradat SQL-H</div>
<div class="p1">
<br /></div>
<div class="p1">
<b>Persist:</b></div>
<div class="p1">
Apache HDFS, IBM GPFS, Lustre, Mapr Data Platform</div>
<div class="p1">
<b>Serialization: </b></div>
<div class="p1">
Apache Avro, RCFile (and ORCFile), SequenceFile, Text, Trevni,</div>
<div class="p1">
DBMS: Apache Accumulo, Apache Cassandra, Apache HBase, Google Dremel</div>
<div class="p1">
<b>Monitor, Administer:</b></div>
<div class="p1">
Apache Ambari, Apache Chukwa, Apache Falcon, Apache oozie, Apache WHirr, Apache ZooKeeper, Cloudera Manager</div>
<div class="p1">
<b>Analytics, Machine Learning: </b><br />
Apache Riill, Apache Hive, Apache Mahout, Datameer, IBM Big Sheets, IBM BIgSQL, Karmasphere, Microsoft Excel, Platfora, Revolution Analytics RHadoop, SAS, Skytree</div>
<div class="p1">
<br /></div>
<div class="p1">
Leading pure plays: Cloudera, Hortonworks, MapR</div>
<div class="p1">
Others: Datastax, LucidWorks, RainStor, SQrrt, WANdisco, Zettaset</div>
<div class="p1">
<br /></div>
<div class="p1">
Hadoop has moved to the next state with Apache Hadoop 2.0.</div>
<div class="p1">
<br /></div>
<div class="p1">
What's Next for Hadoop?</div>
<div class="p1">
</div>
<ul>
<li>Search</li>
<li>Advanced prebuilt analytic functions</li>
<li>Cluster, appliance or cloud?</li>
<li>Virtualization</li>
<li>Graph processing</li>
</ul>
<div class="p1">
What's Still Needed in Hadoop Ecosystem?</div>
<div class="p1">
</div>
<ul>
<li>Security</li>
<li>Data Warehousing Tools</li>
<li>Governance</li>
<li>Skills</li>
<li>Subproject Optimization</li>
<li>Distributed Optimization</li>
</ul>
<div class="p1">
Recommendations</div>
<div class="p1">
<ul>
<li>Audit your data - find "dark data" and map it to business opportunities to identify pilot projects</li>
<li>Familiarize yourself with the capabilities of available Hadoop distributions. </li>
<li>Build skill and recruit it.</li>
</ul>
Is Hadoop starting to happen in the Cloud? Amazon has created 5.5 million Elastic Hadoop instances in last year.<br />
<br />
Ganesha - God of Success<br />
<br />
<b><span style="color: blue;">Shaun Connolly - Hortonworks</span></b></div>
<div class="p1">
Key Requirement of a "Data Lake" store ALL DATA in one place and interact with that data in Multiple Ways.<br />
<br />
YARN is a distributed operating system for processing.<br />
<br />
YARN Takes Hadoop Beyond Batch<br />
Applications run IN Hadoop versus On Hadoop with predictable performance and Quality of Service<br />
<br />
<span style="color: blue;"><b>Mohit Saxena - InMobi</b></span><br />
Good insights in managing and processing data at scale across data centers.<br />
InMobi contributed Apache Falcon to address Hadoop data lifecycle management.<br />
<br />
<span style="color: blue;"><b>Scott Gnau - President, Teradata Labs</b></span><br />
Good discussion on mission critical process and applications.<br />
<br />
<b><span style="color: blue;">Bruno Fernandez - Yahoo</span></b><br />
Search was their first use case for Hadoop. One of the key drives for them is connected devices that are connected to the cloud.<br />
<br />
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-90939586489322318172013-06-23T09:28:00.000-06:002013-09-17T15:44:45.034-06:00Using the HDP Sandbox to Learn SqoopOnce you have your HDP Sandbox up and running, you can use Sqoop to move data between your Hadoop cluster and your relational database. Your Hadoop Hive/HCatalog environment uses a MySQL database server for storing metadata, so you can use the built-in MySQL database server to play with Sqoop. In real life you would not use this specific MySQL database server to play, but I'm going to for this demo. Credit for this demo goes to Tom Hanlon (a longtime friend and great resource in the Hadoop space).<br />
<br />
Be aware, sqoop is not atomic. After a data load, it is a good practice to do a record count on both sides and make sure they match.<br />
<br />
Log into your HDP Sandbox as root to bring up a terminal window (instructions are provided in the Sandbox). The loopback address 127.0.0.1 is a non-routable IP address that refers to the local host.<br />
<br />
<b>Demo One: </b> Move data from a relational database into your Hadoop cluster. Then use HDFS commands to verity the files reside in your Hadoop cluster on HDFS.<br />
Connect to MySQL using the mysql client, create a database and build a simple table.<br />
# mysql<br />
mysql> CREATE DATABASE sqoopdb;<br />
mysql> USE sqoopdb;<br />
mysql> CREATE TABLE mytab (id int not null auto_increment primary key, name varchar(20));<br />
mysql> INSERT INTO mytab VALUES (null, 'Tom');<br />
mysql> INSERT INTO mytab VALUES (null, 'George');<br />
mysql> INSERT INTO mytab VALUES (null, 'Barry');<br />
mysql> INSERT INTO mytab VALUES (null, 'Mark');<br />
mysql> GRANT ALL ON sqoopdb.* to root@localhost;<br />
mysql> GRANT ALL ON sqoopdb.* to root@'%';<br />
mysql> exit;<br />
<br />
-- Sqoop command requires permission to access the database as well as HDFS.<br />
# su - hdfs<br />
$ sqoop import --connect jdbc:mysql://127.0.0.1/sqoopdb --username root --direct --table mytab --m 1<br />
<br />
$ hadoop fs -lsr mytab<br />
$ hadoop fs -cat mytab/part-m-00000<br />
<br />
<b>-- Demo Two: </b> Load data from a relational database into Hive. Then query the data using Hive.<br />
# mysql<br />
mysql> USE sqoopdb;<br />
mysql> CREATE TABLE newtab (id int not null auto_increment primary key, name varchar(20));<br />
mysql> INSERT INTO newtab VALUES (null, 'Tom');<br />
mysql> INSERT INTO newtab VALUES (null, 'George');<br />
mysql> INSERT INTO newtab VALUES (null, 'Barry');<br />
mysql> INSERT INTO newtab VALUES (null, 'Mark');<br />
mysql> exit;<br />
<br />
# su - hdfs<br />
$ sqoop import --connect jdbc:mysql://127.0.0.1/sqoopdb --username root --table newtab \<br />
--direct --m 1 --hive-import<br />
<br />
-- Hive has a command line interface for interfacing with the data. Using the hive metadata, hive users<br />
-- can access the data using a SQL interface. Person running hive command must have read access in<br />
-- HDFS.<br />
$ hive<br />
hive> show tables;<br />
hive> SELECT * FROM newtab;<br />
hive> exit;<br />
$<br />
<br />
<br />
-- The physical files will be stored in the HDFS directory location defined by the following property in the /etc/hive/conf/hive-site.xml file.<br />
-- <property></property><br />
-- <name>hive.metastore.warehouse.dir</name><br />
-- <value>/apps/hive/warehouse</value><br />
-- <br />
<br />
-- Look at the data location in HDFS.<br />
$ hadoop fs -lsr /apps/hive/warehouse/newtab<br />
<br />
-- Look at the data contents.<br />
$ hadoop fs -cat /apps/hive/warehouse/newtab/part-m-00000<br />
<br />
<br />
-- You can use the following help commands along with the documentation to do a lot of examples moving data between your Hadoop cluster and a relational database using Sqoop.<br />
$ sqoop help<br />
$ sqoop help import<br />
<br />
<br />
<div class="O0" style="direction: ltr; margin-bottom: 0pt; margin-left: .38in; margin-top: 6.72pt; text-align: left; text-indent: -.38in; unicode-bidi: embed; vertical-align: baseline;">
<br /></div>
<!--StartFragment-->
<!--EndFragment--><br />
<div class="O0" style="direction: ltr; margin-bottom: 0pt; margin-left: .38in; margin-top: 6.72pt; text-align: left; text-indent: -.38in; unicode-bidi: embed; vertical-align: baseline;">
<br /></div>
<br />
<!--StartFragment-->
<!--EndFragment-->Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-76892865509740951652013-06-23T00:45:00.001-06:002013-11-11T22:58:22.913-07:00The HDP Sandbox is a Great Way to Start Learning Hadoop<b>Use the HDP Sandbox to Develop Your Hadoop Admin and Development Skills</b><br />
Unless you have your own Hadoop Cluster to play with, I strongly recommend you get the HDP Sandbox up and running on your laptop. What's nice about the HDP Sandbox is that it is 100% open source. The features and frameworks are free, you're not learning from some vendor's proprietary Hadoop version that has features they will charge you for. With the Sandbox and HDP you are learning Hadoop from a true open source perspective. <br />
<br />
<b>The Sandbox contains:</b><br />
<ul>
<li>A fully functional Hadoop cluster running Ambari to play with. You can run examples and sample code. Being able to use the HDP Sandbox is a great way to get hands on practice as you are learning.</li>
<li>Your choice of Type 2 Hypervisors (VMware, VirtualBox or Hyper-V) to install Hadoop on.</li>
<li>Hadoop is running on Centos 6.4 and using Java 1.6.0_24 (in VMware VM).</li>
<li>MySQL and Postgres database servers for the Hadoop cluster.</li>
<li>Ability to log in as root in the Centos OS and have command line access to your Hadoop cluster.</li>
<li>Ambari the management and monitoring tool for Apache Hadoop and Openstack.</li>
<li>Hue is included in the HDP Sandbox. Hue is a GUI containing:</li>
</ul>
<ul><ul>
<li>Query editors for Hive, Pig and HCatalog</li>
<li>File Browser for HDFS,</li>
<li>Job Designer/Browser for MapReduce</li>
<li>Oozie editor/dashboard</li>
<li>Pig, HBase and Bash shells</li>
<li>A collection of Hadoop APIs.</li>
</ul>
</ul>
<b>With the Hadoop Sandbox you can:</b><br />
<ul>
<li>Point and click and run through the tutorials and videos. Hit the Update button to get the latest tutorials.</li>
<li>Use Ambari to manage and monitor your Hadoop cluster.</li>
<li>Use the Linux bash shell to log into Centos as root and get command line access to your Hadoop environment.</li>
<ul>
<li>Run a jps command and see all the master servers, data nodes and HBase processes running in your Hadoop cluster. </li>
<li>At the Linux prompt get access to your configuration files and administration scripts. </li>
</ul>
<li>Use the Hue GUI to run pig, hive, hcatalog commands.</li>
<li>Download tools like Datameer and Talend and access your Hadoop cluster from popular tools in the ecosystem.</li>
<li>Download data from the Internet and practice data ingestion into your Hadoop cluster.</li>
<li>Use Sqoop and the MySQL database that is running to practice moving data between a relational database and a Hadoop cluster. (Reminder: This MySQL database is a meta-database for your Hadoop cluster so be careful playing with this. In real life you would not use a meta-database to play, you'd create a separate MySQL database server.</li>
<li>If using VMware Fusion you can create snapshots of your VM, so you can always roll back.</li>
</ul>
<div>
<b>Downloading the HDP Sandbox and Working with an OVA File</b><br />
<div class="p1">
<a href="http://hortonworks.com/products/hortonworks-sandbox/">http://hortonworks.com/products/hortonworks-sandbox/</a></div>
<div class="p2">
<br /></div>
<div class="p1">
The number one gotcha when installing the HDP Sandbox on a laptop is virtualization is not turned on in the BIOS. If you have problems this is the first thing to check.<br />
<br />
I choose the VMware VM, which downloads the Hortonworks+Sandbox+1.3+VMware+RC6.ova file. An OVA (open virtual appliance) is a single file distribution of a OVF stored in the TAR format. A OVF (Open Virtualization Format) is a portable package created to standardize the deployment of a virtual appliance. An OVF package structure has a number of files: a descriptor file, optional manifest and certificate files, optional disk images, and optional resource files (i.e. ISO’s). The optional disk image files can be VMware vmdk’s, or any other supported disk image file. </div>
<div class="p2">
<br /></div>
<div class="p3">
VMware Fusion converts the virtual machine from OVF format to VMware runtime (<span class="s1">.vmx</span>) format.</div>
<div class="p3">
I went to the VMware Fusion menu bar and selected <b>File</b> - <b>Import </b>and imported the OVA file. Fusion performs OVF specification conformance and virtual hardware compliance checks. Once complete you can start the VM.</div>
<br />
<div class="p1">
When you start the VM, if you are asked to upgrade the VM, I choose yes. You'll then be prompted to initiate your Hortonworks Sandbox session, and to open a browser and enter a URL address like:</div>
<div class="p3">
http://172.16.168.128. This will take you to a registration page. When you finish registration it brings up the Sandbox.</div>
<div class="p3">
</div>
<ul>
<li>Instructions are provided for how to start Ambari (management tool), how to login to the VM as root and how to set up your hosts file.</li>
<li>Instructions are provided on how to get your cursor back from the VM.</li>
</ul>
In summary, you download the Sandbox VM file, import it, start the VM and instructions will lead you down the Hadoop yellow brick road. When you start the VM, the initial screen will show you the URL for bringing up the management interface and also how to log in as root in a terminal window. Accessing Ambari Mgmt Interface, <br />
<ul>
<li>The browser URL was http://172.16.168.128 (yours may be different) to get to Videos, Tutorials, Sandbox and Ambari setup instructions.</li>
<li>Running on Mac OS X, hit <ctrl-alt-f5> Ctrl-Alt-F5 to get a root terminal window. Log in as root/hadoop.</ctrl-alt-f5></li>
<li>Make sure you know how to get out of the VM window. On Mac it is Ctrl-Alt-F5<tab-shift-command>.</tab-shift-command></li>
<li><tab-shift-command>Get access to Ambari interface with port 8080, i.e. http://172.16.168.128:8080.</tab-shift-command></li>
</ul>
<br />
<br />
<b>Getting Started with the HDP Sandbox</b><br />
Start with the following steps:</div>
<div>
<ul>
<li>Get Ambari up and running. Follow all the instructions.</li>
<li>Bring up Hue. Look at all the interfaces and shells you have access to.</li>
<li>Log in as root using a terminal interface. In Sandbox 1.3 service accounts are root/hadoop for superuser and hue/hadoop for ordinary user.</li>
<li>Watch the videos.</li>
<li>Run through the tutorials. </li>
</ul>
<div>
Here is the Sandbox welcome screen. You are now walking into the light of Big Data and Hadoop. :) </div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQ6_69WXSEQCR6B6Bw_wnK7kkbtAJSKg_0IDmH6VB32yFRHW9b_mhcM6P5DF9Szoxyxv4CjHhhLt3iUDPM30b1tuomVroVs1Ch35clCR0xrImkgc-G_ay5YFrxLcwxfW3tQ4ZKYZhPxiNe/s1600/Screen+Shot+2013-06-22+at+11.16.28+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="393" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQ6_69WXSEQCR6B6Bw_wnK7kkbtAJSKg_0IDmH6VB32yFRHW9b_mhcM6P5DF9Szoxyxv4CjHhhLt3iUDPM30b1tuomVroVs1Ch35clCR0xrImkgc-G_ay5YFrxLcwxfW3tQ4ZKYZhPxiNe/s640/Screen+Shot+2013-06-22+at+11.16.28+PM.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
A few commands to get you familiar with the Sandbox environment:</div>
<div class="separator" style="clear: both; text-align: left;">
# java -version</div>
<div class="separator" style="clear: both; text-align: left;">
# ifconfig</div>
<div class="separator" style="clear: both; text-align: left;">
# uname -a</div>
<div class="separator" style="clear: both; text-align: left;">
# tail /etc/redhat-release</div>
<div class="separator" style="clear: both; text-align: left;">
# ps -ef | grep mysqld</div>
<div class="separator" style="clear: both; text-align: left;">
# ps -ef | grep postgres</div>
<div class="p1">
# PATH=$PATH:$JAVA_HOME/bin</div>
<div class="p2">
# jps</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
You can run a jps command and see key Hadoop processes running such as the NameNode, Secondary NameNode, JobTracker, DataNode, TaskTracker, HMaster, RegionServer and AmbariServer.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJi9A4_MbgHmQky0yhSTwwYsRbkqS1A3oIRnHhjzIMjfyfHxhzGRM5F_OGjt1zzWckpxuBObd5Gd138XIfkRKWO7w4gFhHF_KwikPSKNUkuUn001wy4wUNDzJxAEZymwnOm1KwIuVDgcep/s1600/Screen+Shot+2013-06-22+at+11.45.36+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="387" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjJi9A4_MbgHmQky0yhSTwwYsRbkqS1A3oIRnHhjzIMjfyfHxhzGRM5F_OGjt1zzWckpxuBObd5Gd138XIfkRKWO7w4gFhHF_KwikPSKNUkuUn001wy4wUNDzJxAEZymwnOm1KwIuVDgcep/s640/Screen+Shot+2013-06-22+at+11.45.36+PM.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
If you cd to the /etc/hadoop/conf directory, you can see the Hadoop configuration files. Hint: core-site.xml, mapred-site.xml and hdfs-site.xml are good files to learn for the HDP admin certification test. :) </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiA4_SyJo4JRhwGfzTQZeeJFxKilovYLXNVTL8D5gnKYBz8MR1t3E6W_yJ4MYR1FEIKoXRB2SKYLlj6npxZq0hi4MWwNaO6ooLrQDxKDWgLJ-f7brR5zhyggofHu1axZ5Z_a5_nbAEr0HFI/s1600/Screen+Shot+2013-06-23+at+7.20.52+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="226" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiA4_SyJo4JRhwGfzTQZeeJFxKilovYLXNVTL8D5gnKYBz8MR1t3E6W_yJ4MYR1FEIKoXRB2SKYLlj6npxZq0hi4MWwNaO6ooLrQDxKDWgLJ-f7brR5zhyggofHu1axZ5Z_a5_nbAEr0HFI/s640/Screen+Shot+2013-06-23+at+7.20.52+AM.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
If you cd to the /usr/lib/hadoop/bin directory, you can see a number of the Hadoop admin scripts.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEy3MLxogPDa3Jt_u1DGv4oUczJFqZctHElrr7Tkzs5F-2XSPKNqLODub9X1LuT1MiRFRBGwUgIxS-XRaB8CBxS_gBJnmabj7fkC-XIwpYH-LJmCVuOQY-zLaDNJVu0rI7p50mqfIcMqeo/s1600/Screen+Shot+2013-06-23+at+8.08.58+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEy3MLxogPDa3Jt_u1DGv4oUczJFqZctHElrr7Tkzs5F-2XSPKNqLODub9X1LuT1MiRFRBGwUgIxS-XRaB8CBxS_gBJnmabj7fkC-XIwpYH-LJmCVuOQY-zLaDNJVu0rI7p50mqfIcMqeo/s640/Screen+Shot+2013-06-23+at+8.08.58+AM.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
Most importantly, Have FUN! :)</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="p1">
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0tag:blogger.com,1999:blog-8585595729814983645.post-48998315780267719762013-06-22T10:08:00.000-06:002013-06-23T01:22:53.600-06:00Rocking Hadoop Summit 2013 in San Jose<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr5HpuauTFwf59QckBHa9oceGqPfEDXXp3GO0HgJ_ZLqn3P9Gkz5qfTuWMdFGTNGnU8xmSahmIl4aHN0ounsdFMsjANe9bBFbknFOttyNtX-3XI03UE6dF21bP_S2lIq_hRM5iz52WaNmf/s1600/HadoopSummitIcon.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="156" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhr5HpuauTFwf59QckBHa9oceGqPfEDXXp3GO0HgJ_ZLqn3P9Gkz5qfTuWMdFGTNGnU8xmSahmIl4aHN0ounsdFMsjANe9bBFbknFOttyNtX-3XI03UE6dF21bP_S2lIq_hRM5iz52WaNmf/s320/HadoopSummitIcon.jpg" width="320" /></a></div>
<div style="text-align: justify;">
I'm really looking forward to presenting at Hadoop Summit again. Presenting at Hadoop Summit in Amsterdam was awesome and San Jose is looking like the best ever. I'll be helping get Summit off to a great start with "Apache Hadoop Essentials: A Technical Understanding for Business User" and then closing the conference with "A Reference Architecture for ETL 2.0". You may even see me at the Dev Cafe giving tours around the Hadoop Sandbox and Savanna. Here are two of the workshops/presentations I will be presenting at:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b><span style="color: blue;">A Technical Understanding for Business Users</span></b> - Joining me will be Manish Gupta("The Wizard of Hadoop", or affectionally known as "Manish Hadoopta" because he can play Hadoop like a piano and can make Hadoop magic. </div>
<div style="text-align: justify;">
<br /></div>
<div class="p2">
<b>Abstract:</b></div>
<div class="p3">
This fast-paced one-day course will provide attendees with a technical overview of Apache Hadoop. Discussions will include understanding Hadoop from a data perspective, design strategies, data architecture, core Hadoop fundamentals, data ingestion options and an introduction to Hadoop 2.0. Hands-on labs will give business users a deeper understanding of Apache Hadoop using real world use cases to help provide the understanding of the power of Hadoop. We will be using the new Hortonworks Sandbox 1.3. The Hortonworks Sandbox is one of the best ways for enthusiasts new to Hadoop to get started. The <a href="http://hortonworks.com/products/hortonworks-sandbox/?utm_source=google&utm_medium=ppc&utm_campaign=Hortonworks_Sandbox_Search&utm_content=textad&utm_keyword=hortonworks%20sandbox">Hortonworks Sandbox</a>:</div>
<div class="p3">
</div>
<ul>
<li>Uses the Hortonworks Data Platform 1.3</li>
<li>See SQL "IN" Hadoop with Apache Hive 0.11, offering 50x improvement in performance for queries.</li>
<li>Learn Ambari the management interface of choice for HDP and OpenStack (Savanna).</li>
<li>Available with a VMware, Virtualbox or Hyper-V virtual machine. </li>
<li>A great way for someone to start learning how to work with a Hadoop cluster.</li>
<li>Lots of excellent tutorials, including:</li>
<ul>
<li>Hello Hadoop World</li>
<li>HCatalot, Basic Pig and Hive commands</li>
<li>Using Excel 2013 to Analyze Hadoop Data</li>
<li>Data Processing with Hive</li>
<li>Loading Data into Hadoop</li>
<li>Visualize Website Clickstream Data</li>
</ul>
</ul>
<br />
<div style="text-align: justify;">
<b><span style="color: blue;">A Reference Architecture for ETL 2.0</span></b> - Presenting with George Vetticaden (Hortonworks Solution Architect), we will be bringing the "Power of George" to Hadoop Summit. :) ETL is such a big part of successful Hadoop implementations, George and I thought we'd help wrap the conference with some best practices, words of wisdom and reference architectures around Hadoop ETL. </div>
<br />
<b>Abstract:</b><br />
<br />
<div class="p1" style="text-align: justify;">
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.</div>
Anonymoushttp://www.blogger.com/profile/00291214843973166978noreply@blogger.com0