Merv Adrian at Gartner Research
"Apache Hadoop is a set of standard open-source software projects that provide a framework for using massive amounts of data across a distributed network."
Traditional IM
- Requirements based
- Top-down design
- Integration and reuse
- Technology/consolidation
- World of DW and ECM
- Competence centers
- Better decisions
- commercial software
Big Data Style
- Opportunity Oriented
- Bottom-up experimentation
- Immediate use
- Toll proliferation
- World of Hadoop
- Hackathons
- Better business
- Open source
Which Projects Are "Hadoop"? Minimum set from Apache website:
- Apache HDFS
- Apache MapReduce
- Apache Yarn
- Other independent Apache projects: Ambari, Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, ZooKeeper
Rich, Complex Set of Functional Choices
- Ingest/Propagate
- Describe, Develop
- Compute, Search
- Persist
Ingest/Propagate:
Apache Flume, Apache Kafka, APache Sqoop, HDFS, NFS, Information HParser, DBMS vendor utilities, Talend, WebHDFS
Apache Flume, Apache Kafka, APache Sqoop, HDFS, NFS, Information HParser, DBMS vendor utilities, Talend, WebHDFS
Describe, Develop:
Apache Crunch, Apache Hive, Apache Pig, Apache Tika, Cascading, Cloudera Hue, DataFu, Dataguise, IBM Jaql
Compute, Search:
Apache Blur, Apache Drill, Apache Giraph, Apache Hama, APache Lucene, Apache MapReduce, Apache Solr, Cloudera Impala, HP HaVEn, IBM BIgSQL, IBM InfoSphere Streams, HStreaming, Pivotal HAWQ, SQLstream, Storm, Teradat SQL-H
Persist:
Apache HDFS, IBM GPFS, Lustre, Mapr Data Platform
Serialization:
Apache Avro, RCFile (and ORCFile), SequenceFile, Text, Trevni,
DBMS: Apache Accumulo, Apache Cassandra, Apache HBase, Google Dremel
Monitor, Administer:
Apache Ambari, Apache Chukwa, Apache Falcon, Apache oozie, Apache WHirr, Apache ZooKeeper, Cloudera Manager
Analytics, Machine Learning:
Apache Riill, Apache Hive, Apache Mahout, Datameer, IBM Big Sheets, IBM BIgSQL, Karmasphere, Microsoft Excel, Platfora, Revolution Analytics RHadoop, SAS, Skytree
Apache Riill, Apache Hive, Apache Mahout, Datameer, IBM Big Sheets, IBM BIgSQL, Karmasphere, Microsoft Excel, Platfora, Revolution Analytics RHadoop, SAS, Skytree
Leading pure plays: Cloudera, Hortonworks, MapR
Others: Datastax, LucidWorks, RainStor, SQrrt, WANdisco, Zettaset
Hadoop has moved to the next state with Apache Hadoop 2.0.
What's Next for Hadoop?
- Search
- Advanced prebuilt analytic functions
- Cluster, appliance or cloud?
- Virtualization
- Graph processing
What's Still Needed in Hadoop Ecosystem?
- Security
- Data Warehousing Tools
- Governance
- Skills
- Subproject Optimization
- Distributed Optimization
Recommendations
- Audit your data - find "dark data" and map it to business opportunities to identify pilot projects
- Familiarize yourself with the capabilities of available Hadoop distributions.
- Build skill and recruit it.
Ganesha - God of Success
Shaun Connolly - Hortonworks
Key Requirement of a "Data Lake" store ALL DATA in one place and interact with that data in Multiple Ways.
YARN is a distributed operating system for processing.
YARN Takes Hadoop Beyond Batch
Applications run IN Hadoop versus On Hadoop with predictable performance and Quality of Service
Mohit Saxena - InMobi
Good insights in managing and processing data at scale across data centers.
InMobi contributed Apache Falcon to address Hadoop data lifecycle management.
Scott Gnau - President, Teradata Labs
Good discussion on mission critical process and applications.
Bruno Fernandez - Yahoo
Search was their first use case for Hadoop. One of the key drives for them is connected devices that are connected to the cloud.
YARN is a distributed operating system for processing.
YARN Takes Hadoop Beyond Batch
Applications run IN Hadoop versus On Hadoop with predictable performance and Quality of Service
Mohit Saxena - InMobi
Good insights in managing and processing data at scale across data centers.
InMobi contributed Apache Falcon to address Hadoop data lifecycle management.
Scott Gnau - President, Teradata Labs
Good discussion on mission critical process and applications.
Bruno Fernandez - Yahoo
Search was their first use case for Hadoop. One of the key drives for them is connected devices that are connected to the cloud.
No comments:
Post a Comment