Reading Notes: Design Data Intensive Application Chapter 1

11 min readMar 19, 2022

Rainier Cherry, June 2019, Brentwood, CA

Introduction

At early stage of my career as a software engineer, I often hear many fancy words when people describes about software systems: large scale, distributed, highly reliable, highly performance, highly available, and highly consistent. I cannot really tell what exactly they are, what are the difference between them and why they sounds so appealing. I sometimes even mention these words when I do a presentation and update my resume. However, only I know that I have zero confident if some one asking me why you think your application is “highly available”.

These puzzle words surrounded me for years and I do learn what they are from here and there, however, only after last year I spent months reading Martin Kleppmann’s Book Design Data Intensive Application three times, I have had a systematic view of what they are and what it takes to achieve them and why it is so challenging and exciting to make a distributed system highly performance, highly scalable and highly available.

This year, one of my goals is to organize and share my reading notes. Not only it helps myself to review what I have learnt from the book, and deep dive into details, but also I hope it can help more people learn the beauty of distributed systems.

Disclaimer: This is my personal learning notes, my interpretation of the book may not be 100% correct, some wording may be different from the original book, which could be improper or incorrect. Please do not consider this as a short version of the book. Please feel free to let me know if you find any mistakes or have any comments and suggestions. If you are interested, I will highly recommend you to read the book by yourself.

This blog covers:

Preface
Part I
Chapter 1

Preface

There are many interesting developments in the database area.

The driving force includes:

Internet companies need to handle large amounts of data.
Business intelligence needs to be agile, respond quickly to market insights.
Open source software becomes successful.
Multi-core CPUs become standard, and the network is getting faster, parallel processing is getting popular.
Even small companies build systems that are distributed to multiple machines and geo-locations, thanks to IaaS like AWS.
Many services are expected to be HA. Extended downtime is not acceptable.

Data-intensive means: The main challenge is the complexity and the changing speed of data.

The book tries to find useful ways of thinking about data systems — not just how they work, but also why they work that way.

You will be in a better position to decide which technologies are proper for which purpose, and how to combine them to form the foundation of a good application.

Who Should Read This Book?

Software engineers, software architects, and technical managers that want to learn how to make your applications:

highly scalable
highly available and operational robust
highly maintainable, even as they grow and as the requirements, the technologies are changing.

The scope of the book

discuss the various principles and trade-offs that are fundamental to data systems
explore the different design decisions taken by different products

Part I Foundations of Data Systems

Chapter 1: introduce the terminology and approach, such as reliability, scalability, and maintainability
Chapter 2: introduce different data models and query languages, how to choose them for different situations.
Chapter 3: introduce the internals of different storage engines, how to choose them for different workloads.
Chapter 4: introduce different data encodings (serialization), how they fare in an environment where application requirements and data schemas change over time.

Chapter 1: Reliable, Scalable and Maintainable

C1S1: What is data intensive systems

Many applications today are data-intensive, as opposed to compute-intensive. Computation (such as CPU power) is rarely a limiting factor for these applications, the real problem is the amount, complexity and changing speed of the data.

Data intensive application is typically built from standard building blocks that provides commonly needed functionalities:

store and query data (database)
store and query intermediate result of expensive operations (cache)
search/filter (search indexes)
send messages that will be handled asynchronously (stream processing)
periodically process data (batch processing)

Today, the boundaries between the categories of data systems are becoming blurring. For example, Redis (can also be used as message queue), Kafka (can also provide data persistence)

Today, application developers need to stitch multiple data systems together to fulfill the requirements — designing new special purpose data systems, based on smaller general purpose data systems.

When designing data systems, you need to consider:

Highly reliable: ensure data remains correct and complete, even if something goes wrong internally.
Highly performant: provide consistently good performance
Highly scalable: be able to scale to handle an increase in load
Other factors such as skills and experience of developers, legacy systems, time to delivery, tolerance of different kinds of risk, regulatory constraints (compliance)

This book we focus on the following three concerns that are most important in most software systems:

Reliability — continue to work correctly (perform correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error)
Scalability — be able to deal with system growth (data volume, traffic, complexity)
Maintainability — be able to productively maintain current behavior and adapting new features

C1S2: What is Reliability

Glossary

Reliability: that the system should continue to work correctly, even when things go wrong.
Fault: the thing that can go wrong.
Fault tolerant/reliable: the system that can anticipate faults and can cope with them.
Failure: when the system as a whole stops providing required service to users. it is different from fault.
Chaos management: deliberately increase the rate of faults, to ensure that the fault-tolerant machinery is continuously exercised and tested. Increase your confidence that the system is reliable.

Note 1: The system is not able to handle every possible kind of fault. When talking about fault tolerant/reliable systems, we need to define what kind of faults the system is able to handle.

Note 2: some faults cannot be cured (e.g. if an attacker comprises user sensitive data such as password), in that case, we need to prevent the fault. (out of the scope of this book)

Hardware Faults

For example, Hard disks crash, RAM becomes faulty, the power grid has a blackout, someone unplugs the wrong network cable.

The first response is to add redundancy to individual hardware component. For example, disk RAID configuration, dual power supplies, hot swappable CPUs, data center backup powers.

However, as data volumes and applications’ computing demands have increased, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults.

Hence, there’s a move to tolerate the loss of an entire machine, by using software fault-tolerance techniques. These techniques also add additional operational advantages: allow planned downtime and rolling updates without downtime of the entire system.

Software Errors

Software errors are usually harder to anticipate because they are correlated across nodes.

They can cause bigger damages to the system. For example:

A software bug that can cause the entire system to crash, when some bad inputs are given.
A process that uses up shared resources (CPU, memory, storage, network bandwidth).
A service that become unresponsive.

There is no quick solution, but the following items can help

Thoroughly think about assumptions and interaction in the system.
Thorough testing
Isolation of process and network
Define, monitor and analyze system metrics such as CPU, memory, storage, network etc.
Chaos management

Human Errors

Humans are known to be unreliable. For example, one study found that configuration errors by operators were the leading cause of outages, where hardware faults played a role in only 10–25% of outages.

How to protect the system from human errors?

Design a system in a way that minimize the opportunities for human errors. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.”
Decouple the places where people make the most mistakes from the places where people can cause failures. For example: provide sandbox/pre-production environments where people can experiment and explore.
Test thoroughly at all levels. Have automated testing.
Allow quick and easy recovery. For example, allow fast and easy rollback, gradual roll out new code, and provide tools to recompute data (in case old computation was incorrect).
Set up detailed and clear monitoring.
Implement good management practices and training (out of the scope of this book)

C1S3: What is Scalability

Scalability is a term we use to describe a system’s ability to cope with increased load.

Before discussing scalability, we need to succinctly describe the current load on the system. Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system. For example, it may be QPS to a web server, read/write ratio to a database, the number of simultaneously active users in a chat room, the hit rate on a cache.

What happens when the load increases? We can look at it in two ways:

When a load parameter increases and the system resources (CPU, memory, network bandwidth, etc.) are unchanged, how is the performance of your system affected?
When a load parameter increases, how much do you need to increase the system resources if you want to keep performance unchanged?

Both questions require us to describe performance quantitatively. There are two main performance numbers, and the best choice of the performance numbers are also application specific.

Response time: the time between a client sending a request and receiving a response, which is usually more important in online systems.
Throughput: the number of queries/operations/records that can be processed per second, which is usually more important in a batch processing system such as Hadoop.

In practice, response time can be vary a lot as a system handles a variety of requests. We usually measure response time not as a single number, but as a distribution of values.

To describe the distribution, average response time and percentiles are usually used. Using average response time is common but it is not a very good metric as it doesn’t tell you how many users actually experience delays. Percentiles, especially higher percentiles such as p95, p99, p999 are more useful as they directly affect user’s experience of the service.

Percentiles are often used in SLO and SLA. For example, a service is considered as up if it has a median response time of < 200ms and p99 < 1s. The service is required to be up at least 99.9% of the time.

Head-of-blocking: when the system has limited resources, and a small number of slow requests can hold up the process of the following requests. All clients will experience slowness.
Tail latency amplification: one request requires multiple backend calls, even if only one small portion of the backend calls are slow, the overall request will be slow.

How do we maintain good performance even when our load parameters increase by some amount? There are mainly two approaches: scaling up vs scaling out:

Scaling up (vertically scaling, moving to powerful machines). The system can be often simpler, but the high end machines are very expensive and are not capable to scale infinitely due to hardware limitations.
Scaling out (horizontally scaling, adding more machines), which has no upper bound, but it is often more complicated for a stateful data system.

Elastic vs manual

Elastic system can automatically add resources when detecting a load increase, which is useful if load is highly unpredictable
Manually scaled system relies on human to analyze the capacity and decide to add more resources when needed, which is simpler and may have fewer operational surprises

The common practice is to scale up (a single node) until the cost of scaling up is too high or high availability is a concern. However, the practice is also changing as distributed systems are becoming the default for many applications.

There’s no one-size-fits-all scalable architecture. Needs to design based on system load characteristics.

An architecture that is appropriate for one level of load is unlikely to cope with 10x that load. If you are working on a fast-growing service, it is therefore likely that you will need to rethink your architecture on every order of magnitude load increase — or perhaps even more often than that.

C1S4: Maintainability

It is well known that the majority of the cost of software is not in its initial development, but in its ongoing maintenance:

investigating failures, fixing bugs
keeping its systems operational
adapting it to new platforms
modifying it for new use cases and adding new features
repaying technical debt

Maintainability shouldn’t be an afterthought. We can and we should design software in such a way that minimizes pain during maintenance, and thus avoid creating legacy software ourselves.

There are three aspects of maintainability:

Operability: easy for the operation team to keep the system running smoothly.
Simplicity: easy for new engineers to understand the system.
Evolvability: easy for engineers to make changes for new/changing requirements/use cases

Operability

Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy, including:

Good monitoring, which provides good visibility to the system
Good automation & integration, which make it easy to update the system.
Good documentation and an easy-to-understand operational model. Exhibiting predictable behavior, minimizing surprises.
Good default behavior and self healing. But give the administrators the freedom to override and manually control when needed.

Simplicity

Reducing complexity improves software maintainability. It does not necessarily mean reducing its functionality. It can also mean removing accidental complexity.

Complexity is accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

Abstraction is one of the best tools to deal with accidental complexity.

It can hide implementation details behind clean, simple-to-understand facades
Moreover, it can be reused more efficiently than reimplementing a similar thing multiple times and also leads to higher-quality software, as quality improvements in the abstracted component benefit all applications that use it.

Evolvability

System requirements change constantly and we must ensure that we’re able to deal with those changes.

Agile working patterns provide a framework for adapting to change. The agile community developed technical tools and patterns such as test-driven development (TDD) and refactoring.

Simplicity and abstractions will also help: simple and easy-to-understand
systems are usually easier to modify than complex ones.

More:
What is Agile?
incrementally, which means divide the project into small but consumable increments, and deliver them one-by-one.
iteratively: first deliver the highest value item first, and then iteratively adding new features and robustness.
the benefit of agile is that it responds to change quickly as the requirement, plan and results are evaluated continuously, issues can be identified and fixed earlier as testing is ongoing with each iteration.
What is Scrum?
In scrum, the product is built on a series of fixed-length iterations called sprints. Giving the agile team a framework to ship products in a regular cadence.

C1S5: Summary

Reliability means making systems work correctly, even when faults occur. Faults can come from hardware, software or humans. Faults are not avoidable, but the key is to apply different fault tolerance techniques to hide the faults from end users.

Scalability means having strategies for keeping performance good, even when load increases. The key is to measure system load and performance characteristics quantitatively, and make design decisions based on them.

Maintainability has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. The key is abstraction, monitoring, and automation.