A DBA's Journey into the Cloud and Big Data: Hadoop Summit Keynote

I wanted to share key thoughts from the keynote at Hadoop Summit 2013 in San Jose.

Merv Adrian at Gartner Research
"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network."

Traditional IM

Requirements based
Top-down design
Integration and reuse
Technology/consolidation
World of DW and ECM
Competence centers
Better decisions
commercial software

Big Data Style

Opportunity Oriented
Bottom-up experimentation
Immediate use
Toll proliferation
World of Hadoop
Hackathons
Better business
Open source

Which Projects Are "Hadoop"? Minimum set from Apache website:

Apache HDFS
Apache MapReduce
Apache Yarn
Other independent Apache projects: Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, ZooKeeper

Rich, Complex Set of Functional Choices

Ingest/Propagate
Describe, Develop
Compute, Search
Persist

Ingest/Propagate:
Apache Flume, Apache Kafka, APache Sqoop, HDFS, NFS, Information HParser, DBMS vendor utilities, Talend, WebHDFS

Describe, Develop:

Apache Crunch, Apache Hive, Apache Pig, Apache Tika, Cascading, Cloudera Hue, DataFu, Dataguise, IBM Jaql

Compute, Search:

Apache Blur, Apache Drill, Apache Giraph, Apache Hama, APache Lucene, Apache MapReduce, Apache Solr, Cloudera Impala, HP HaVEn, IBM BIgSQL, IBM InfoSphere Streams, HStreaming, Pivotal HAWQ, SQLstream, Storm, Teradat SQL-H

Persist:

Apache HDFS, IBM GPFS, Lustre, Mapr Data Platform

Serialization:

Apache Avro, RCFile (and ORCFile), SequenceFile, Text, Trevni,

DBMS: Apache Accumulo, Apache Cassandra, Apache HBase, Google Dremel

Monitor, Administer:

Apache Ambari, Apache Chukwa, Apache Falcon, Apache oozie, Apache WHirr, Apache ZooKeeper, Cloudera Manager

Analytics, Machine Learning:
Apache Riill, Apache Hive, Apache Mahout, Datameer, IBM Big Sheets, IBM BIgSQL, Karmasphere, Microsoft Excel, Platfora, Revolution Analytics RHadoop, SAS, Skytree

Leading pure plays: Cloudera, Hortonworks, MapR

Others: Datastax, LucidWorks, RainStor, SQrrt, WANdisco, Zettaset

Hadoop has moved to the next state with Apache Hadoop 2.0.

What's Next for Hadoop?

Search
Advanced prebuilt analytic functions
Cluster, appliance or cloud?
Virtualization
Graph processing

What's Still Needed in Hadoop Ecosystem?

Security
Data Warehousing Tools
Governance
Skills
Subproject Optimization
Distributed Optimization

Recommendations

Audit your data - find "dark data" and map it to business opportunities to identify pilot projects
Familiarize yourself with the capabilities of available Hadoop distributions.
Build skill and recruit it.

Is Hadoop starting to happen in the Cloud? Amazon has created 5.5 million Elastic Hadoop instances in last year.

Ganesha - God of Success

Shaun Connolly - Hortonworks

Key Requirement of a "Data Lake" store ALL DATA in one place and interact with that data in Multiple Ways.

YARN is a distributed operating system for processing.

YARN Takes Hadoop Beyond Batch
Applications run IN Hadoop versus On Hadoop with predictable performance and Quality of Service

Mohit Saxena - InMobi
Good insights in managing and processing data at scale across data centers.
InMobi contributed Apache Falcon to address Hadoop data lifecycle management.

Scott Gnau - President, Teradata Labs
Good discussion on mission critical process and applications.

Bruno Fernandez - Yahoo
Search was their first use case for Hadoop. One of the key drives for them is connected devices that are connected to the cloud.

A DBA's Journey into the Cloud and Big Data

Wednesday, June 26, 2013

Hadoop Summit Keynote - San Jose, 2013

No comments:

Post a Comment