Lambda Architecture

·

5 min read

Contents:

  • What is Lambda Architecture?

  • Architecture Diagram.

  • Overview and different layers of Lambda Architecture.

  • Working of the Lambda Architecture.

  • Merits and Demerits of Lambda Architecture.

  • Conclusion

Lambda Architecture

Lambda Architecture is a data processing architecture designed to handle massive amounts of data by using batch processing and stream processing methods. The idea arose from a blog post which was authored by Nathan Marz in 2011.

The Lambda architecture is a good choice for applications that need to process both historical and real-time data. It is also a good choice for applications that need to handle a large volume of data.

The name Lambda Architecture resembles Greek letter ‘λ’ (lambda). The name Lambda for this architecture was set because of its combination of two processing types. The Lambda Architecture combines two distinct data processing approaches: batch processing and stream (real-time) processing. The ‘λ’ (lambda) symbol can be seen as representing this dual approach.

Architecture Diagram

Overview and different layers of Lambda Architecture

In simple words, Lambda Architecture is a design pattern for large scale data processing and it handles massive amounts of data by combining both its batch processing and speed(real-time) processing methods.

In the diagram above, you can see the main components of the Lambda Architecture. Let’s talk about the core components of the Lambda Architecture:

  • Data Sources: The data sources in the Lambda architecture can be any type of data that needs to be processed. This can include structured data, such as data from a database, and unstructured data, such as data from social media or sensors.

  • Batch Layer: The batch layer is the traditional way of processing data. It takes a large amount of data and processes it all at once. This can be done using a variety of tools, such as Hadoop and Hive. The batch layer is typically used to generate historical views of the data, such as aggregated reports and trend analysis.

  • Speed Layer: The speed layer is used to process data in real time. This is done using streaming data processing tools, such as Apache Kafka and Spark Streaming. The speed layer is typically used to process data that needs to be analyzed or acted upon quickly, such as fraud detection and customer churn.

  • Serving Layer: The serving layer combines the results of the batch layer and the speed layer and makes them available for queries.

Here is a list of tools used in the Lambda architecture:

  • Apache Hadoop is used to store data and create distributed clusters.

  • Apache Kafka is used for data streaming in the speed layer.

  • Hadoop Distributed File System (HDFS) is used for managing immutable data in the batch layer.

  • Apache Spark is used for data streaming, graph processing, and data batch process.

  • Apache Cassandra is used to store real-time views.

  • Apache Storm is used for the speed layer tasks.

  • Apache HBASE is used for the serving layer tasks.

Working of the Lambda Architecture

The Data Sources, which can be diverse such as logs, streams, or databases, can then be processed. The Batch Layer processes this data in batches, typically at regular intervals. Unlike Speed Layer, the Batch Layer will accumulate data and then process it all at once at the set intervals.

The Speed Layer, on the other hand, is designed to handle and process data streams as they come in, almost instantaneously. Simply put, the Speed Layer handles data in real-time.

The Serving Layer combines the outputs from both the Batch and Speed Layers, making them available for queries. Importantly, the Serving Layer is where the precomputed views of the data are stored for efficient querying. “Precomputed Views” of the data means that instead of storing raw data and computing the results every time a query is made, the results are computed in advance and stored in a format optimized for quick retrieval.

Merits of Lambda Architecture

  1. Latency Reduction: Raw data is indexed in the serving layer for querying historical data, there's a delay before it becomes available for analysis due to batch indexing. The speed layer uses stream processing to index recent data, reducing the time window and minimizing latency.

  2. Data Consistency: The architecture ensures data consistency across distributed systems. By processing data sequentially, it avoids the inconsistencies that can arise in distributed systems where data might not be uniformly updated across all replicas.

  3. Scalability: The architecture is built on scale-out technologies, allowing for expansion by adding more nodes at various layers. This makes sure it can handle vast amounts of data.

  4. Fault Tolerance: Based on distributed systems, the architecture is resilient to hardware failures. Any indexing failures can be addressed by rerunning the indexing job, ensuring continuous data availability.

  5. Human Fault Tolerance: All raw data is stored, acting as a record for analyzable data. If there are bugs in the indexing process, the data can be reindexed after fixing the issues.

Demerits of Lambda Architecture

The Lambda architecture can be complex and expensive to implement. This is because it requires two separate layers of processing, the batch layer and the speed layer. This makes the architecture complex. It also requires lot of resources to maintain because its processing data both in real time and in historical time.

So basically the main problem with the Lambda architecture is that it requires two codebases for the batch layer and the stream layer. This just adds more complexity to the application design and implementation. Also, complex codebases are hard to maintain and debug.

Another software architecture pattern called the Kappa Architecture is an good alternative to Lambda Architecture. You can read more about its comparison with Lambda Architecture.

Conclusion

Overall, the Lambda architecture is a powerful and versatile architecture that can be used to handle a wide variety of data processing needs. While it offers numerous advantages like reduced latency, scalability, and fault tolerance, it also has some challenges like the dual-layered approach can introduce complexity, both in terms of implementation and maintenance of the code base.

However, alternatives like the Kappa Architecture offer a more streamlined approach, it is important to carefully consider the specific needs of the application before implementing the Lambda architecture.