Wednesday, June 26, 2013

Hadoop Summit Keynote - San Jose, 2013

I wanted to share key thoughts from the keynote at Hadoop Summit 2013 in San Jose.

Merv Adrian at Gartner Research
"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network."

Traditional IM
  • Requirements based
  • Top-down design
  • Integration and reuse
  • Technology/consolidation
  • World of DW and ECM
  • Competence centers
  • Better decisions
  • commercial software
Big Data Style
  • Opportunity Oriented
  • Bottom-up experimentation
  • Immediate use
  • Toll proliferation
  • World of Hadoop
  • Hackathons
  • Better business
  • Open source


Which Projects Are "Hadoop"? Minimum set from Apache website:
  • Apache HDFS
  • Apache MapReduce
  • Apache Yarn
  • Other independent Apache projects: Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, ZooKeeper
Rich, Complex Set of Functional Choices
  • Ingest/Propagate
  • Describe, Develop
  • Compute, Search
  • Persist
Ingest/Propagate:
Apache Flume, Apache Kafka, APache Sqoop, HDFS, NFS, Information HParser, DBMS vendor utilities, Talend, WebHDFS

Describe, Develop: 
Apache Crunch, Apache Hive, Apache Pig, Apache Tika, Cascading, Cloudera Hue, DataFu, Dataguise, IBM Jaql

Compute, Search:
Apache Blur, Apache Drill, Apache Giraph, Apache Hama, APache Lucene, Apache MapReduce, Apache Solr, Cloudera Impala, HP HaVEn, IBM BIgSQL, IBM InfoSphere Streams, HStreaming, Pivotal HAWQ, SQLstream, Storm, Teradat SQL-H

Persist:
Apache HDFS, IBM GPFS, Lustre, Mapr Data Platform
Serialization: 
Apache Avro, RCFile (and ORCFile), SequenceFile, Text, Trevni,
DBMS: Apache Accumulo, Apache Cassandra, Apache HBase, Google Dremel
Monitor, Administer:
Apache Ambari, Apache Chukwa, Apache Falcon, Apache oozie, Apache WHirr, Apache ZooKeeper, Cloudera Manager
Analytics, Machine Learning: 
Apache Riill, Apache Hive, Apache Mahout, Datameer, IBM Big Sheets, IBM BIgSQL, Karmasphere, Microsoft Excel, Platfora, Revolution Analytics RHadoop, SAS, Skytree

Leading pure plays: Cloudera, Hortonworks, MapR
Others: Datastax, LucidWorks, RainStor, SQrrt, WANdisco, Zettaset

Hadoop has moved to the next state with Apache Hadoop 2.0.

What's Next for Hadoop?
  • Search
  • Advanced prebuilt analytic functions
  • Cluster, appliance or cloud?
  • Virtualization
  • Graph processing
What's Still Needed in Hadoop Ecosystem?
  • Security
  • Data Warehousing Tools
  • Governance
  • Skills
  • Subproject Optimization
  • Distributed Optimization
Recommendations
  • Audit your data - find "dark data" and map it to business opportunities to identify pilot projects
  • Familiarize yourself with the capabilities of available Hadoop distributions. 
  • Build skill and recruit it.
Is Hadoop starting to happen in the Cloud?  Amazon has created 5.5 million Elastic Hadoop instances in last year.

Ganesha -  God of Success

Shaun Connolly - Hortonworks
Key Requirement of a "Data Lake" store ALL DATA in one place and interact with that data in Multiple Ways.

YARN is a distributed operating system for processing.

YARN Takes Hadoop Beyond Batch
Applications run IN Hadoop versus On Hadoop with predictable performance and Quality of Service

Mohit Saxena - InMobi
Good insights in managing and processing data at scale across data centers.
InMobi contributed Apache Falcon to address Hadoop data lifecycle management.

Scott Gnau - President, Teradata Labs
Good discussion on mission critical process and applications.

Bruno Fernandez - Yahoo
Search was their first use case for Hadoop.  One of the key drives for them is connected devices that are connected to the cloud.



No comments:

Post a Comment