One of the first steps towards learning big data and how it works is to understand the Lambda architecture. The Lambda architecture shows how it is possible to process large volumes of data in a scalable way that does not increase in computing complexity as the system grows larger. With the Lambda architecture:
- You have a general method for processing an arbitrary function on an arbitrary data set and returning the result with low latency.
- Batch processes are written as if they are single-threaded applications, and the infrastructure will handle the parallelization across a cluster of machines. This supports the growth of data sets and the number of machines in the cluster to ever increasing sizes.
- Processing data is broken into three layers: the batch layer, the serving layer and the speed layer.
- Extensibility and scalability.
- Human fault tolerance due to data immutability. Handles run time failures automatically.
- Manages all the concurrency, scheduling merging of data.
- Low latency reads and updates.
- Performance is more predictable.
- A system that can address a wide variety applications that need to access large volumes of data.
- Reduced maintenance.
- A number of robust open source tools and frameworks for solving problems.
I strongly recommend as you learn big data, to focus on fundamentals, learn baseline frameworks, reference architectures and best practices. Some examples where mistakes were made in the past:
- When programmers wanted to learn object-oriented programming they focused on learning C++ or Java. Where the emphasis should have been in learning Object Oriented Analysis and Design first. Lots of programming projects have paid the price of this approach.
- When learning virtualization and the cloud, people focus on learning products. Pointing and clicking is not the way to build production platforms. Learn infrastructures, managing virtual workloads, reference architectures, best practices etc.