A DBA's Journey into the Cloud and Big Data

Tuesday, September 24, 2013

Oracle 12c Creating Large Demand for New Technical Skills

Oracle 12c Features and Products will Create Largest Transformation in Technical Skills in Oracle's History
The Oracle Open World conference is showing Oracle technical and business people that there is going to be a tremendous demand for new skill sets. It's not just their new announcements, it's that some of their existing products have matured to a new level. So get ready, this is going to create tremendous demand for those who build their technical skills to match the demand that is hitting the Oracle ecosystem this week.

Oracle Open World Highlights
Here are some of the areas that are going to require new and improved skill sets.

Big Data, Hadoop and Analytics

Oracle promoting Oracle Big Data Appliance, highlighting using an engineered packaged system for Hadoop.

Oracle and Microsoft: The Cloud OS

This is going to create a big demand for the skills to deploy DBaaS (database as a service).
You can take your licenses and move them from on-premise to the Cloud on Windows Azure.
Java fully supported in Windows Azure.
Oracle license mobility to Windows Azure.
Oracle to offer Oracle Linux on Windows Azure.
Oracle and Microsoft building common cloud platform, the Cloud OS.
Oracle software on Windows Server Hyper-V and Windows Azure.
Message to ISVs, get going on this.

Oracle Database as a Service

Monthly subscription pricing, pay as you go.
Oracle manages the database for you.

Quarterly patching and upgrades with SLAs.
Automated backup and point-in-time recovery

Java as a Service in Oracle Cloud

Dedicated WebLogic clusters(s) on compute service.
Oracle backs up, patches manages WebLogic.
Full WLST, JMS, Root access, EM WebLogic Control.
Monthly subscription pricing.
Runs any Java EE application.
Elastic compute and storage.

Oracle Multitenant: 12c Pluggable Databases.

Oracle is abstracting out an Oracle database server as a software container within Oracle. This is a way of achieving benefits of virtualization, it's virtualizing Oracle as a software container versus virtualizing within a VM. Benefits:

High consolidation density.
Rapid provisioning and cloning using SQL. A pluggable database is a software container that can move to a different database server.
New paradigms for rapid patching and upgrades.
Can manage multiple databases as one.

Oracle Database Backup, Logging and Recovery Appliance

This Oracle database appliance will perform backups by creating a backup and then maintaining snapshots of changes. There is a new feature in RMAN (incremental-forever) that will take periodic snapshots to storage-equipped appliance, where backups will be saved.

Oracle Database 12c In-Memory Database

M6 32 Big Memory Machine - 32TB of DRAM.
32 SPARC M6 Chips, 1024 Memory DIMMS.
12 cores per processor, 96 threads per processor.
CPUs communicate using 384 port silicon switching network, 3TB/s bandwidth.

3 TB/second system bandwidth.
1.4 TB/second memory bandwidth.
1 TB/second I/O bandwidth.
100x faster queries.
2x increase in transaction processing rates.
Analytics run faster on column format, fast acessing few columns, many rows.
Transactions run faster with row format, faster processing a few rows with many columns.
Oracle 12c: Stores data in both formats simultaneously, dual format in-memory database.
Scan billions of rows per second per CPU core.
In Memory Column store replaces analytic indexes.
Configure memory capacity

inmemory_size = XXX GB
Alter table | partition ... inmemory;

Transparent to SQL and applications.
Scale-Out In-Memory Database to any size.

Friday, September 20, 2013

Enterprise Data Movement at OOW

Our Enterprise Data Movement presentation (UGF9722) on Sunday 12:30pm, September 22 at Oracle Open World 2013 is going to be a key presentation that anyone who works with Oracle data should attend.

Big Data is going to touch every Oracle DBA, BI, EDW and analytics team member. Top DBAs in the industry moved into Oracle RAC and then Oracle Exadata platforms. You are now seeing the top Oracle leaders in the industry moving into Hadoop. The reason is, they see the evolution of data and understand how Big Data platforms have become very strategic for every customer. The key skills top DBAs have such as infrastructure knowledge, data architecture, being able to perform bottleneck performance tuning, understand large data ingestion, designing a platform, understanding how to work with large data systems are all key foundational skills necessary for Hadoop.

I am joined by Michael Schrader an internationally recognized data architect and BI strategist. Our primary goal is to provide vision and insight around Big Data transport strategies. The key information in this presentation is not in books. We will be sharing successful strategies and lessons learned in the field.

This detailed presentation will discuss important considerations about data movement across Oracle data platforms and Hadoop. Attendees will see how data flows between Oracle Database instances, Oracle data warehouses, Oracle Exadata, and Hadoop. Attendees will learn about enterprise data movement, data ingestion tools, metadata strategy to make sure you can use all your existing tools, understanding how to work with heterogenous sources and transports, and additional considerations for leveraging data in a logical data warehouse. We will discuss Hadoop, Oracle and 3rd party tools and the role they play in enterprise data movement.

The agenda is:

The Role of Big Data and Hadoop
ETL Strategies and Reference Architecture
Common Processing Patterns
Logical Processing Tiers
Different data processing dimensions
Best Practices for Processing Data
Connectors for Hadoop

Hope to see you there.

We've got very cool Hortonworks goodies for you.

Wednesday, August 28, 2013

VMWorld from a Hadoop Perspective

It's been an excellent week at VMWorld. I've been focusing on virtualizing Hadoop and business critical applications. My highlights:

Announcing the book "Successfully Virtualizing Business Critical Oracle Databases on VMware", that I am writing with Charles Kim (ViscosityNA), Darryl Smith (EMC) and Steven Jones (VMware).
Will soon be announcing a new Hadoop book I will be writing.
Spent a lot of time this week with VMware big data engineers and experts. Enjoyed the vExperts reception last night. Had some great conversations around virtualizing Hadoop.
Presenting best practices on virtualizing Hadoop, Oracle and business critical applications.
Presenting on Virtualizing Mission Critical Oracle RAC with vSphere and vCOPs. This presentation is going to show VMware admins how to deploy Oracle Database as a Service without DBAs.

VMware's goal with the Software Defined Data Center (SDDC) is to take customers to 100% virtualization. VMware's vSAN and NSX will allow virtualization from compute to now include virtualization of networks and storage. VMware used to be about customers being 70% virtualized and moving the line to 75% or 80% virtualized. Now focused on moving customer to 100% virtualized which means aggressively virtualizing Business Critical Applications (Oracle, SAP, etc) and Hadoop. The SDDC is still a goal and more software pieces have to be put in to accomplish this goal.

VMware's vSAN supports virtual storage directly in the hypervisor.
VMware's NSX is one of the biggest areas of interest. NSX is very strategic for VMware because it allows the virtualization of the entire network stack. NSX is to networking what ESXi is to virtualizing hardware resources. With NSX, switching, routing, bridging and firewalls are all part of the hypervisor. Here is the design pattern (source VMware).

Additional Highlights

VMWorld has gotten a lot bigger with an estimated 22,500 attendees.
Large vendor area had a lot of energy. Spunk had the coolest t-shirts by far.
Sharkk case seemed to be the coolest case for the iPad mini. Seen a lot more iPad minis than iPads among the VMware jet set.

Hadoop at VMWorld
Hadoop presentations all start out discussing the benefits of Hadoop, use cases and then VMware's strategy around virtualizing Hadoop (Serengeti and Big Data Extensions). Lots of use cases around the volume of semi structured and unstructured data. Example: a single GE jet engine produces 10 TB of data in hour. 90 Petabytes per year.

GE looking at all their fridges, all machines they develop to call home for repairs and maintenance.
GE focusing on early detection of faults, common model failures and product engineering support.

As you would expect polling of audiences showed almost no knowledge of Hadoop.

Key benefits of virtualizing Hadoop from presentations:

Fast provisioning of data nodes.
Workload consolidation
High Availability
Auto elasticity high resource utilization
True multi-tenancy
Promoting elastic compute with virtualizing data nodes.
Leveraging virtual networks.
Leveraging ability to control noisy neighbors with things like storage and network I/O control.

VMware vCenter now has a plugin for Hadoop (called Big Data Extensions).

Emphasizing if you virtualize you need to use Serengiti for deployments if virtualizing Hadoop because Serengeti understands VMware vCenter.

VMware is calling the Hadoop Virtual Extensions (HVE) their Big Data Extensions.

Fed-Ex showed how they are using scale-out NAS with Hadoop.

Talked about how from their perspective if the network is fast enough, data locality is not really that important.
Talked about using Isilon storage with Hadoop.

Identifed Inc. did an overview of their Hadoop experience.

Started out using AWS. They were using 200 VMs and it was costing them about $40k a month. When they started running 24/7 the cost went up another $20-40k per month. They found that performance from AWS was very spikey.
Moved on premise with Serengeti to reduce costs. Moved to SupeNap in La Vegas. They saved $20k a month by moving off of AWS.
They got their ROI within two months of getting off of AWS from a cost perspective.
They decided to mix physical and virtual. Virtualized all master servers and stayed physical with data nodes.
They used a Fat Twin platform.
Use anti-affinity rules for master servers. Especially zookeeper and journal nodes.
They used mixed storage. Used flash for OS on nodes, and for data nodes local storage.
They are now exploring virtualizing their data nodes and separating compute and data. So tasktrackers will be separate from datanodes. They want to have elastic compute.
They do not have anyone that is a Hadoop administrator. They have different developers rotate into the infrastructure team for 3-6 months. Then when developers rotate back to development teams they keep their permissions and can manage their Hadoop clusters within the individual developer teams.

VMware announced vSphere 5.5, here are a few highlights:

With ESXi 5.5, the hypervisor supports up of 320 logical cores (5.1 supports 160 logical cores).
Up to 4TB of memory for an ESXi host (5.1 supports 2TB of memory).
Fault tolerance can now support up to four vCPUs. Means VMware will be pushing for this method to achieve HA with Hadoop master servers versus the new Hadoop HA features in Hadoop 2.0.
NUMA Nodes per host 16 (was 8)
Things coming: Auto elastic Hadoop, Support of YARN in future, Support of HBase in future.
vSphere 5.5 now supports application high availability. This supports application recovery within a VM.
Project Serengeti" tools support Hadoop deployments. Not sure if this will be in the first release of 5.5 or not.
VMDK maximum size to 62 TB.
Misc:

No change in the pricing of vSphere editions.
4 new features (click on the link for many details): AppHA, Reliable Memory, Flash Read Cache and Big Data Extensions
Latency-sensitivity feature for applications like very high performance computing and stock trading apps
vSphere Hypervisor (free) has no physical memory limit anymore (was 32 GB)
PCI hotplug support for SSD
VMFS heap size improvements
16 GB End to end Fibre channel. So 16 GB from host to switch and 16 GB from switch to SAN
Support for 40 Gpbs NICs Enhanced IPv6 support
Enhancements for CPU C-states. This reduces power consumption.
Expanded vGPU support: In vSphere 5.1 VMware only supports NVIDIA GPU.
Support for the Ivy Bridge-EP Xeon E5 v2 processors (Intel) and the Opteron 3300,4300 and 6300 processors (Advanced Micro Devices).
The ability to vMotion a virtual machine between different GPU vendors is also supported. If hardware mode is enabled in the source host and GPU does not exist in the destination host, vMotion will fail and will not attempt a vMotion. added Microsoft Windows Server 2012 guest clustering support
AHCI Controller Support which enables Mac OS guests to use IDE CDROM drives. AHCI is an operating mode for SATA.

Friday, August 16, 2013

Introducing the BigDataOverEasy LinkedIn Group

I've been getting a lot of pings on the #BigDataOverEasy group that just started and I wanted to introduce it to you. I've been involved with leading edge companies i.e. MySQL, Sun Microsystems, Oracle, VMware, Hortonworks as well as been a part of their user communities. I've also been directly involved in leaderships around strategic councils, beta leadership programs as well as recognized industry expert programs such as Oracle ACE, Sun Ambassadorship and VMware vExpert.

As Big Data and Hadoop generate more momentum in the industry I want to be involved in a group that is focused on skill development, sharing and exchange, reference architectures, best practices, industry directions and mind share. So I see tremendous value around having a group that can span companies, user communities and regions for this knowledge sharing and being able to communicate with a broader audience.

The name BigDataOverEasy explains the goal of the group. To be a group focused on Big Data and related technologies, i.e. Hadoop that makes it easier to learn and grow without having to go into the coal mines to extract knowledge or reinvent the wheel. I'd like this group to be about Data Architects, Platform Architects, Administrators, Business Intelligence and Analytics teams and developers who want to share knowledge and experiences and have a group of peers they can exchange ideas with.

This group will not support any recruiters, sales or spamming. Here are some of the ecosystems I will reach out to, to find like minded big data and Hadoop enthusiasts.

Hortonworks Data Platform
Rackspace
Microsoft
Teradata
Oracle
MySQL
Red Hat

We decided to call this Special Interest Group BigDataOverEasy even though it will focus on Hadoop because we want to emphasize that Hadoop is about data.

Wednesday, August 14, 2013

Hadoop Reference Architectures

Some initial key factors for success when building a Hadoop cluster is to build a solid foundation. This includes:

Selecting the right hardware. In working with a hardware vendor make sure you are working from their hardware compatibility lists (HCL) and you are making the right decisions for your cluster. Commodity hardware does not have to be generic. You can select commodity hardware that is customized for running Hadoop.
Build an enterprise OS platform. Whether you are using Linux or Windows, customize and tune your operating system using enterprise best practices and standards for Hadoop.
Design your Hadoop clusters using reference architectures. Don't reinvent the wheel unless you have to. Vendors are publishing Hadoop reference architectures that give you a great starting place.

I've included a few HDP reference architectures to give you a feel for what a Hadoop platform may look like.

I'd like to keep building this reference architecture list.

Monday, August 12, 2013

A Changing Era for Oracle DBAs

Throughout my career in IT I've always tried to stay on the leading edge of technology. During my journeys I have seen three evolutions of Oracle DBAs and I now see we are about to enter a fourth era for Oracle DBAs. The eras up to this point have been:

In The Land of The Blind, The One Eyed Man is King - This was during the early releases of Oracle from Version 4 through Version 6. Relational databases and Oracle were relatively new to the industry so if you had any common sense about IT, Oracle technology or relational database concepts you were worth your weight in silver because Oracle was a growing technology and market.
The Speeds and Feeds DBA - This was the time between Oracle 7 and Oracle 10g. Very technical DBAs that understood the internals of Oracle, infrastructure, tuning, backup and recovery, RAC, Streams, Data Guard were not only hard to find but were worth their weight in gold. This in-depth level of knowledge came from reading the source code and/or working endlessly to learn the internals of how to maximize Oracle in the infrastructure. These very technical DBAs made great careers out of their knowledge. I think of this as the golden age of DBAs.
The G DBAs - This is the time of the Google and GUI DBAs. These DBAs are a product of the new evolving Oracle environment and successes of the Oracle product. Oracle software can now identify problems, fix those problems and perform very detailed operations at the click of a button. There are also tremendous amounts of books, whitepapers, blog posts, tutorials that can teach someone a tremendous amount about Oracle in a short time that used to take years of experience and effort to acquire. So now you have a large percentage of these GUI and Google DBAs that can perform work that previously required highly skilled experts with years of experience. One thing about the GUI and Google DBAs is they can be much easier to replaced and outsourced. Don't get me wrong, the Speeds and Feeds DBAs are still needed but not as much and there are less of them around every year.
The Platform DBA - If the Speeds and Feeds era was the golden age, then this is the platinum age. The top DBAs in the world today are not only Oracle experts but also infrastructure experts. They are also recognized experts in areas such as architecture, design, storage, networking, applications and business. When you look at the top Oracle RAC, Exadata, ASM, GoldenGate, MAA DBAs they are not only experts in a specific Oracle domain but also in the environment surrounding Oracle. So who are the Oracle DBAs that are going to dominate the market place and be worth their weight in platinum in the next few years? It's going to be the platform DBAs. The Platform DBAs are Oracle experts who also understand areas such as the Cloud, Enterprise Virtualization Platforms, Big Data, Enterprise Data Management and the business.

In every company I meet with I see that structured data is going to significantly be the smallest percentage of data that companies need to manage. Unstructured and semi-structured data is going to dwarf structured data (traditional databases and warehouses). This is a consistent industry perspective and is reiterated by industry analysts. DBAs need to be data experts not just Oracle experts. In the next few years, we will see:

Oracle as a service will increase and we will see more movement into the cloud.
Oracle environments will be interacting more with Big Data environments.
Oracle tier one environments will increasingly be virtualized.
Oracle business applications still continue to dominate the market and application DBAs who understand the business applications are continuing to increase in demand.

Every new era in Oracle and the IT industry creates tremendous opportunity for those with the drive, energy and who see those opportunities. I look forward to interacting and sharing knowledge, experiences and wisdom as we move into this new era.

Saturday, June 29, 2013

Weaknesses in Traditional Data Platforms

Everyone understands that Hadoop brings high performance commercial computing to organizations using relatively low cost commodity storage. What is accelerating the move to Hadoop are weaknesses in traditional relational and data warehouse platforms in meeting today's business needs. Some key weaknesses of traditional platforms include:

Late binding with schemas greatly increase the latency between receiving new data sources and deriving business value from this data.
The significantly high cost and complexity of SAN storage. This high cost forces organizations to aggregate and remove a lot of data that contains high business value. Important details and information are getting thrown out or hidden in aggregated data.
The complexity of working with semi-structured and unstructured data.
The incredible cost, complexity and ramifications of maintaining database administrators, storage and networking teams in traditional platforms. There are lots of silos of expertise and software required in traditional environments that have dramatic effects on agility and cost. It's gotten to the point that vendors are now delivering extremely expensive engineered systems to deal with the complexity of these silos. These expensive engineered systems require even more specialized expertise to maintain and make customers ever more dependent on the vendors. What's funny is you hear the old phrase "one throat to choke but it's the customer whose choking on the cost. With Hadoop's self-healing and fault tolerance a small team can manage thousands of servers. A single Hadoop administrator can manage 1000 - 3000 nodes all on relatively inexpensive commodity hardware.

While the above highlights the need for Hadoop, it's also important to understand traditional relational databases and data warehouses still have the same role and are needed. A relational database provides a completely different function that a Hadoop cluster. Also, a company is not going to throw out all their existing data warehouses or the expertise and reporting they've built around them. Hadoop today is usually used to add new capabilities to an enterprise data environment not replace existing platforms.

The old line of no one ever gets fired for buying IBM is a thing of the past with Hadoop. An entire organization may go under if your competition is effectively using big data and you are not. Hadoop is the most disruptive technology since the .com days.

Friday, June 28, 2013

Hadoop Summit 2013 in San Jose

It has been a privilege to present at the Hadoop Summits this year in Amsterdam and San Jose. This week was one of the best networking weeks I've ever had at a conference. Great seeing all my Oracle, VMware, Rackspace and MySQL friends as well as meeting a lot of new friends in the Hadoop ecosystem.

Key takeaways:

Hadoop's disruption of the IT industry is accelerating.
Hadoop 2.0 will significantly increase enterprise adoption.
YARN is the distributing operating system of the future.
Incredible success stories of the ROI around Hadoop.
Open source community is about innovation, community and sharing.
Lots of analytics software competing to run on Hadoop. This will be the big battleground.
Hortonworks reinforces it's innovation and leadership in defining the roadmap for Hadoop.
Hortonworks constantly demonstrated their platform expertise with Hadoop.
Hadoop is a high performance commercial computing environment.

Two coolest things I liked at the conference:

An 8-node Raspberry PI Hadoop cluster.
Creating a multi-node (VM) Hadoop cluster on your laptop using the Hortonworks Sandbox.

I was also able to barter for a Yahoo soccer ball (all it cost me was a Hortonworks water bottle). :)

Wednesday, June 26, 2013

Hadoop - It's All About The Data

A key point to understand about Hadoop is that it's all about the data. Don't lose focus. It's easy to get hung up on Hive, Pig, HBase, HCatalog and lose sight of designing the right data architecture. Also, if you have a strong background in data warehouse design, BI, analytics, etc. all those skills are transferable to Hadoop. Hadoop just takes data warehousing to new levels of scalability and agility with reduction of business latency while working with data sets ranging from structured to unstructured data. Hadoop 2.0 and YARN are going to move Hadoop deep into the enterprise and allow organizations to make faster and more accurate business decisions. The ROI of Hadoop is multiple factors higher than the traditional data warehouse. Companies should be extremely nervous about being out Hadooped by their competition.

Newbies often look at Hadoop with wide eyes versus understanding that Hadoop has a lot of components to it that they already understand such as: clustering, distributed file systems, parallel processing, batch and stream processing.

A few key success factors for a Hadoop project are:

Start with a good data design using a scalable reference architecture.
Building successful analytical models that provide business value.
Be aggressive in reducing the latency between data hitting the disk and leveraging business value from that data.

The ETL strategies and data set generation for Hadoop is similar to what you are going to want to do in your Hadoop cluster. It's important to look at your data warehouse and understanding how your enterprise data strategy is going to evolve with Hadoop now a part of the ecosystem.

"Hadoop cannot be an island, it must integrate with Enterprise Data Architecture". - HadoopSummit

"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network." - Merv Adrian at Gartner Research

This is a sample Hadoop 1.x cluster so you can see the key software processes that make up Hadoop. The good point of this diagram is that if you understand it you are probably worth another $20-30k. :)

YARN (Hadoop 2.0) is the distributed operating system of the future. YARN allows you to run multiple applications in Hadoop, all sharing a common resource management. YARN is going to disrupt the data industry to a level not seen since the .com days.

A Hadoop cluster will usually have multiple data layers.

Batch Layer: Raw data is loaded into a data set that is immutable so it becomes your source of truth. Data scientists and analysts can start working with this data as soon as it hits the disk.
Serving Layer: Just as in a traditional data warehouse, this data is often massaged, filtered and transformed into a data set that is easier to do analytics on. Unstructured and semi-structured data will be put into a data set that is easier to work with. Metadata is then attached to this data layer using HCatalog so users can access the data in the HDFS files using abstract table definitions.
Speed Layer: To optimize the data access and performance often additional data sets (views) are calculated to create a speed layer. HBase can be used for this layer dependent on the requirements.

This diagram emphasizes two key points:

The different data layers you will have in your Hadoop cluster.
The importance of building your metadata layer (HCatalog).

With the massive scalability of Hadoop, you need to be able to automate as much as possible and manage the data in your cluster. This is where Falcon is going to play a key role. Falcon is a data lifecycle management framework that provides the data orchestration, disaster recovery as well as data retention you need to manage your data.

Hadoop Summit Keynote - San Jose, 2013

I wanted to share key thoughts from the keynote at Hadoop Summit 2013 in San Jose.

Merv Adrian at Gartner Research
"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network."

Traditional IM

Requirements based
Top-down design
Integration and reuse
Technology/consolidation
World of DW and ECM
Competence centers
Better decisions
commercial software

Big Data Style

Opportunity Oriented
Bottom-up experimentation
Immediate use
Toll proliferation
World of Hadoop
Hackathons
Better business
Open source

Which Projects Are "Hadoop"? Minimum set from Apache website:

Apache HDFS
Apache MapReduce
Apache Yarn
Other independent Apache projects: Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, ZooKeeper

Rich, Complex Set of Functional Choices

Ingest/Propagate
Describe, Develop
Compute, Search
Persist

Ingest/Propagate:
Apache Flume, Apache Kafka, APache Sqoop, HDFS, NFS, Information HParser, DBMS vendor utilities, Talend, WebHDFS

Describe, Develop:

Apache Crunch, Apache Hive, Apache Pig, Apache Tika, Cascading, Cloudera Hue, DataFu, Dataguise, IBM Jaql

Compute, Search:

Apache Blur, Apache Drill, Apache Giraph, Apache Hama, APache Lucene, Apache MapReduce, Apache Solr, Cloudera Impala, HP HaVEn, IBM BIgSQL, IBM InfoSphere Streams, HStreaming, Pivotal HAWQ, SQLstream, Storm, Teradat SQL-H

Persist:

Apache HDFS, IBM GPFS, Lustre, Mapr Data Platform

Serialization:

Apache Avro, RCFile (and ORCFile), SequenceFile, Text, Trevni,

DBMS: Apache Accumulo, Apache Cassandra, Apache HBase, Google Dremel

Monitor, Administer:

Apache Ambari, Apache Chukwa, Apache Falcon, Apache oozie, Apache WHirr, Apache ZooKeeper, Cloudera Manager

Analytics, Machine Learning:
Apache Riill, Apache Hive, Apache Mahout, Datameer, IBM Big Sheets, IBM BIgSQL, Karmasphere, Microsoft Excel, Platfora, Revolution Analytics RHadoop, SAS, Skytree

Leading pure plays: Cloudera, Hortonworks, MapR

Others: Datastax, LucidWorks, RainStor, SQrrt, WANdisco, Zettaset

Hadoop has moved to the next state with Apache Hadoop 2.0.

What's Next for Hadoop?

Search
Advanced prebuilt analytic functions
Cluster, appliance or cloud?
Virtualization
Graph processing

What's Still Needed in Hadoop Ecosystem?

Security
Data Warehousing Tools
Governance
Skills
Subproject Optimization
Distributed Optimization

Recommendations

Audit your data - find "dark data" and map it to business opportunities to identify pilot projects
Familiarize yourself with the capabilities of available Hadoop distributions.
Build skill and recruit it.

Is Hadoop starting to happen in the Cloud? Amazon has created 5.5 million Elastic Hadoop instances in last year.

Ganesha - God of Success

Shaun Connolly - Hortonworks

Key Requirement of a "Data Lake" store ALL DATA in one place and interact with that data in Multiple Ways.

YARN is a distributed operating system for processing.

YARN Takes Hadoop Beyond Batch
Applications run IN Hadoop versus On Hadoop with predictable performance and Quality of Service

Mohit Saxena - InMobi
Good insights in managing and processing data at scale across data centers.
InMobi contributed Apache Falcon to address Hadoop data lifecycle management.

Scott Gnau - President, Teradata Labs
Good discussion on mission critical process and applications.

Bruno Fernandez - Yahoo
Search was their first use case for Hadoop. One of the key drives for them is connected devices that are connected to the cloud.

Sunday, June 23, 2013

Using the HDP Sandbox to Learn Sqoop

Once you have your HDP Sandbox up and running, you can use Sqoop to move data between your Hadoop cluster and your relational database. Your Hadoop Hive/HCatalog environment uses a MySQL database server for storing metadata, so you can use the built-in MySQL database server to play with Sqoop. In real life you would not use this specific MySQL database server to play, but I'm going to for this demo. Credit for this demo goes to Tom Hanlon (a longtime friend and great resource in the Hadoop space).

Be aware, sqoop is not atomic. After a data load, it is a good practice to do a record count on both sides and make sure they match.

Log into your HDP Sandbox as root to bring up a terminal window (instructions are provided in the Sandbox). The loopback address 127.0.0.1 is a non-routable IP address that refers to the local host.

Demo One: Move data from a relational database into your Hadoop cluster. Then use HDFS commands to verity the files reside in your Hadoop cluster on HDFS.
Connect to MySQL using the mysql client, create a database and build a simple table.
# mysql
mysql> CREATE DATABASE sqoopdb;
mysql>   USE sqoopdb;
mysql>   CREATE TABLE mytab (id int not null auto_increment primary key, name varchar(20));
mysql>   INSERT INTO mytab VALUES (null, 'Tom');
mysql> INSERT INTO mytab VALUES (null, 'George');
mysql>   INSERT INTO mytab VALUES (null, 'Barry');
mysql>   INSERT INTO mytab VALUES (null, 'Mark');
mysql>   GRANT ALL ON sqoopdb.* to root@localhost;
mysql>   GRANT ALL ON sqoopdb.* to root@'%';
mysql> exit;

-- Sqoop command requires permission to access the database as well as HDFS.
# su - hdfs
$ sqoop import --connect jdbc:mysql://127.0.0.1/sqoopdb --username root --direct --table mytab --m 1

$ hadoop fs -lsr mytab
$ hadoop fs -cat mytab/part-m-00000

-- Demo Two:  Load data from a relational database into Hive. Then query the data using Hive.
# mysql
mysql> USE sqoopdb;
mysql> CREATE TABLE newtab (id int not null auto_increment primary key, name varchar(20));
mysql> INSERT INTO newtab VALUES (null, 'Tom');
mysql> INSERT INTO newtab VALUES (null, 'George');
mysql> INSERT INTO newtab VALUES (null, 'Barry');
mysql> INSERT INTO newtab VALUES (null, 'Mark');
mysql> exit;

# su - hdfs
$ sqoop import --connect jdbc:mysql://127.0.0.1/sqoopdb --username root --table newtab \
--direct --m 1 --hive-import

-- Hive has a command line interface for interfacing with the data. Using the hive metadata, hive users
-- can access the data using a SQL interface. Person running hive command must have read access in
-- HDFS.
$ hive
hive> show tables;
hive> SELECT * FROM newtab;
hive> exit;
$

-- The physical files will be stored in the HDFS directory location defined by the following property in the /etc/hive/conf/hive-site.xml file.
--
-- hive.metastore.warehouse.dir
-- /apps/hive/warehouse
--

-- Look at the data location in HDFS.
$ hadoop fs -lsr /apps/hive/warehouse/newtab

-- Look at the data contents.
$ hadoop fs -cat /apps/hive/warehouse/newtab/part-m-00000

-- You can use the following help commands along with the documentation to do a lot of examples moving data between your Hadoop cluster and a relational database using Sqoop.
$ sqoop help
$ sqoop help import

The HDP Sandbox is a Great Way to Start Learning Hadoop

Use the HDP Sandbox to Develop Your Hadoop Admin and Development Skills
Unless you have your own Hadoop Cluster to play with, I strongly recommend you get the HDP Sandbox up and running on your laptop. What's nice about the HDP Sandbox is that it is 100% open source. The features and frameworks are free, you're not learning from some vendor's proprietary Hadoop version that has features they will charge you for. With the Sandbox and HDP you are learning Hadoop from a true open source perspective.

The Sandbox contains:

A fully functional Hadoop cluster running Ambari to play with. You can run examples and sample code. Being able to use the HDP Sandbox is a great way to get hands on practice as you are learning.
Your choice of Type 2 Hypervisors (VMware, VirtualBox or Hyper-V) to install Hadoop on.
Hadoop is running on Centos 6.4 and using Java 1.6.0_24 (in VMware VM).
MySQL and Postgres database servers for the Hadoop cluster.
Ability to log in as root in the Centos OS and have command line access to your Hadoop cluster.
Ambari the management and monitoring tool for Apache Hadoop and Openstack.
Hue is included in the HDP Sandbox. Hue is a GUI containing:

Query editors for Hive, Pig and HCatalog
File Browser for HDFS,
Job Designer/Browser for MapReduce
Oozie editor/dashboard
Pig, HBase and Bash shells
A collection of Hadoop APIs.

With the Hadoop Sandbox you can:

Point and click and run through the tutorials and videos. Hit the Update button to get the latest tutorials.
Use Ambari to manage and monitor your Hadoop cluster.
Use the Linux bash shell to log into Centos as root and get command line access to your Hadoop environment.

Run a jps command and see all the master servers, data nodes and HBase processes running in your Hadoop cluster.
At the Linux prompt get access to your configuration files and administration scripts.

Use the Hue GUI to run pig, hive, hcatalog commands.
Download tools like Datameer and Talend and access your Hadoop cluster from popular tools in the ecosystem.
Download data from the Internet and practice data ingestion into your Hadoop cluster.
Use Sqoop and the MySQL database that is running to practice moving data between a relational database and a Hadoop cluster. (Reminder: This MySQL database is a meta-database for your Hadoop cluster so be careful playing with this. In real life you would not use a meta-database to play, you'd create a separate MySQL database server.
If using VMware Fusion you can create snapshots of your VM, so you can always roll back.

Downloading the HDP Sandbox and Working with an OVA File

http://hortonworks.com/products/hortonworks-sandbox/

The number one gotcha when installing the HDP Sandbox on a laptop is virtualization is not turned on in the BIOS. If you have problems this is the first thing to check.

I choose the VMware VM, which downloads the Hortonworks+Sandbox+1.3+VMware+RC6.ova file. An OVA (open virtual appliance) is a single file distribution of a OVF stored in the TAR format. A OVF (Open Virtualization Format) is a portable package created to standardize the deployment of a virtual appliance. An OVF package structure has a number of files: a descriptor file, optional manifest and certificate files, optional disk images, and optional resource files (i.e. ISO’s). The optional disk image files can be VMware vmdk’s, or any other supported disk image file.

VMware Fusion converts the virtual machine from OVF format to VMware runtime (.vmx) format.

I went to the VMware Fusion menu bar and selected File - Import and imported the OVA file. Fusion performs OVF specification conformance and virtual hardware compliance checks. Once complete you can start the VM.

When you start the VM, if you are asked to upgrade the VM, I choose yes. You'll then be prompted to initiate your Hortonworks Sandbox session, and to open a browser and enter a URL address like:

http://172.16.168.128. This will take you to a registration page. When you finish registration it brings up the Sandbox.

Instructions are provided for how to start Ambari (management tool), how to login to the VM as root and how to set up your hosts file.
Instructions are provided on how to get your cursor back from the VM.

In summary, you download the Sandbox VM file, import it, start the VM and instructions will lead you down the Hadoop yellow brick road. When you start the VM, the initial screen will show you the URL for bringing up the management interface and also how to log in as root in a terminal window. Accessing Ambari Mgmt Interface,

The browser URL was http://172.16.168.128 (yours may be different) to get to Videos, Tutorials, Sandbox and Ambari setup instructions.
Running on Mac OS X, hit Ctrl-Alt-F5 to get a root terminal window. Log in as root/hadoop.
Make sure you know how to get out of the VM window. On Mac it is Ctrl-Alt-F5.
Get access to Ambari interface with port 8080, i.e. http://172.16.168.128:8080.

Getting Started with the HDP Sandbox
Start with the following steps:

Get Ambari up and running. Follow all the instructions.
Bring up Hue. Look at all the interfaces and shells you have access to.
Log in as root using a terminal interface. In Sandbox 1.3 service accounts are root/hadoop for superuser and hue/hadoop for ordinary user.
Watch the videos.
Run through the tutorials.

Here is the Sandbox welcome screen. You are now walking into the light of Big Data and Hadoop. :)

A few commands to get you familiar with the Sandbox environment:

# java -version

# ifconfig

# uname -a

# tail /etc/redhat-release

# ps -ef | grep mysqld

# ps -ef | grep postgres

# PATH=$PATH:$JAVA_HOME/bin

# jps

You can run a jps command and see key Hadoop processes running such as the NameNode, Secondary NameNode, JobTracker, DataNode, TaskTracker, HMaster, RegionServer and AmbariServer.

If you cd to the /etc/hadoop/conf directory, you can see the Hadoop configuration files. Hint: core-site.xml, mapred-site.xml and hdfs-site.xml are good files to learn for the HDP admin certification test. :)

If you cd to the /usr/lib/hadoop/bin directory, you can see a number of the Hadoop admin scripts.

Most importantly, Have FUN! :)

Saturday, June 22, 2013

Rocking Hadoop Summit 2013 in San Jose

I'm really looking forward to presenting at Hadoop Summit again. Presenting at Hadoop Summit in Amsterdam was awesome and San Jose is looking like the best ever. I'll be helping get Summit off to a great start with "Apache Hadoop Essentials: A Technical Understanding for Business User" and then closing the conference with "A Reference Architecture for ETL 2.0". You may even see me at the Dev Cafe giving tours around the Hadoop Sandbox and Savanna. Here are two of the workshops/presentations I will be presenting at:

A Technical Understanding for Business Users - Joining me will be Manish Gupta("The Wizard of Hadoop", or affectionally known as "Manish Hadoopta" because he can play Hadoop like a piano and can make Hadoop magic.

Abstract:

This fast-paced one-day course will provide attendees with a technical overview of Apache Hadoop. Discussions will include understanding Hadoop from a data perspective, design strategies, data architecture, core Hadoop fundamentals, data ingestion options and an introduction to Hadoop 2.0. Hands-on labs will give business users a deeper understanding of Apache Hadoop using real world use cases to help provide the understanding of the power of Hadoop. We will be using the new Hortonworks Sandbox 1.3. The Hortonworks Sandbox is one of the best ways for enthusiasts new to Hadoop to get started. The Hortonworks Sandbox:

Uses the Hortonworks Data Platform 1.3
See SQL "IN" Hadoop with Apache Hive 0.11, offering 50x improvement in performance for queries.
Learn Ambari the management interface of choice for HDP and OpenStack (Savanna).
Available with a VMware, Virtualbox or Hyper-V virtual machine.
A great way for someone to start learning how to work with a Hadoop cluster.
Lots of excellent tutorials, including:

Hello Hadoop World
HCatalot, Basic Pig and Hive commands
Using Excel 2013 to Analyze Hadoop Data
Data Processing with Hive
Loading Data into Hadoop
Visualize Website Clickstream Data

A Reference Architecture for ETL 2.0 - Presenting with George Vetticaden (Hortonworks Solution Architect), we will be bringing the "Power of George" to Hadoop Summit. :) ETL is such a big part of successful Hadoop implementations, George and I thought we'd help wrap the conference with some best practices, words of wisdom and reference architectures around Hadoop ETL.

Abstract:

More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.

Tuesday, April 9, 2013

Key Links for Virtualizing Oracle

Key Links for Best Practices

Lot of the best practices for virtualizing Oracle applies to hypervisors in general, whether it's VMware or Oracle VM. Obviously, there are specific best practices when it comes to features that are specific to either of the products. For example, we need to create separate interfaces on the VM host (ESXi host or Oracle VM Server) to segment off management related network traffic (i.e. management related traffic to maintain a network heartbeat or the traffic to perform live migrations (vMotion in VMware)). At a minimum, each physical host needs to have 4 physical network interface cards. 6 Network interface cards will be highly recommended. We will create a bonded network interfaces for the following network workloads:

1. 2 NICs bonded for the public network for all oracle database related traffic

2. 2 NICs bonded for oracle private network between the RAC clusters

3. 2 NICs bonded for communication between the ESXi or Oracle VM Server host machines

All the best practices that are applicable at the VM Guest level apply to both VMware and Oracle VM. For example, we want to enable jumbo frames on the Guest VM. We also want to setup hugepages and disable NUMA at the Guest VM level.

In general, we also do not want to over-commit memory or CPUs for production environments. For databases that fit well for consolidation, we can consider over-committing memory or CPUs.

For additional information for best practices for VMware, please read the following articles.

Four key documents for virtualizing Oracle

DBA Best Practices

http://info.vmware.com/content/12581_VirtApps_index?src=socmed_BCAblog&xyz=

Oracle Databases on VMware – RAC Workload Characterization Study http://www.vmware.com/files/pdf/partners/oracle/Oracle_Databases_on_VMware_-_Workload_Characterization_Study.pdf

Oracle Databases on VMware – RAC Deployment Guide http://www.vmware.com/files/pdf/partners/oracle/vmware-oracle-rac-deploy-guide.pdf

High Availability Guide

http://www.vmware.com/files/pdf/partners/oracle/Oracle_Databases_on_VMware_-_High_Availability_Guidelines.pdf

vCloud Suite and vCloud Networking and Security

vCloud Editions

http://www.vmware.com/products/datacenter-virtualization/vcloud-suite/compare.html

vCloud Networking and Security