Sunday, November 11, 2012

Learn the Lambda Architecture to Understand Big Data

In learning any new paradigm, it's necessary to see how the architecture and processes are different than the old (comfortable) ways of doing things.  It's also important to understand how the new paradigm can align better to meet business needs and future challenges.  To learn big data it's important to invest time in really understanding how big data works and how it is different than traditional systems like relational databases.  However in learning any system that processes data, the key is always to "know your data".  This is a universal constant that holds true in big data the same way it holds true in traditional data systems.

One of the first steps towards learning big data and how it works is to understand the Lambda architecture.  The Lambda architecture shows how it is possible to process large volumes of data in a scalable way that does not increase in computing complexity as the system grows larger.   With the Lambda architecture:
  • You have a general method for processing an arbitrary function on an arbitrary data set and returning the result with low latency.
  • Batch processes are written as if they are single-threaded applications, and the infrastructure will handle the parallelization across a cluster of machines.   This supports the growth of data sets and the number of machines in the cluster to ever increasing sizes.
  • Processing data is broken into three layers: the batch layer, the serving layer and the speed layer.
Big data provides properties that provide strength to the technology.  Some of these include:
  • Extensibility and scalability.
  • Human fault tolerance due to data immutability.  Handles run time failures automatically.  
  • Manages all the concurrency, scheduling merging of data.
  • Low latency reads and updates.
  • Performance is more predictable.
  • A system that can address a wide variety applications that need to access large volumes of data.
  • Reduced maintenance.
  • A number of robust open source tools and frameworks for solving problems.

I strongly recommend as you learn big data, to focus on fundamentals, learn baseline frameworks, reference architectures and best practices.  Some examples where mistakes were made in the past:
  • When programmers wanted to learn object-oriented programming they focused on learning C++ or Java.  Where the emphasis should have been in learning Object Oriented Analysis and Design first.  Lots of programming projects have paid the price of this approach.
  • When learning virtualization and the cloud, people focus on learning products.  Pointing and clicking is not the way to build production platforms.  Learn infrastructures, managing virtual workloads, reference architectures, best practices etc.
Big data is an exciting new paradigm that is disrupting the IT industry today.  I look forward to seeing you all in the big data space.

No comments:

Post a Comment