Chapter II

Databases and DBMS

What is a Database?

A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a Database Management System (DBMS). Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just database.

What is DBMS?

A database typically requires a comprehensive database software program known as a Database Management System (DBMS). A DBMS serves as an interface between the database and its end-users or programs, allowing users to retrieve, update, and manage how the information is organized and optimized. A DBMS also facilitates oversight and control of databases, enabling a variety of administrative operations such as performance monitoring, tuning, and backup and recovery.

Components

Here are some common components found across different databases:

Schema

The role of a schema is to define the shape of a data structure, and specify what kinds of data can go where. Schemas can be strictly enforced across the entire database, loosely enforced on part of the database, or they might not exist at all.

Table

Each table contains various columns just like in a spreadsheet. A table can have as meager as two columns and upwards of a hundred or more columns, depending upon the kind of information being put in the table.

Column

A column contains a set of data values of a particular type, one value for each row of the database. A column may contain text values, numbers, enums, timestamps, etc.

Row

Data in a table is recorded in rows. There can be thousands or millions of rows in a table having any particular information.

Types

Below are different types of databases:

SQL
NoSQL
- Document
- Key-value
- Graph
- Timeseries
- Wide column
- Multi-model

SQL and NoSQL databases are broad topics and will be discussed separately in SQL databases and NoSQL databases. Learn how they compare to each other in SQL vs NoSQL databases.

Challenges

Some common challenges faced while running databases at scale:

Absorbing significant increases in data volume: The explosion of data coming in from sensors, connected machines, and dozens of other sources.
Ensuring data security: Data breaches are happening everywhere these days, it’s more important than ever to ensure that data is secure but also easily accessible to users.
Keeping up with demand: Companies need real-time access to their data to support timely decision-making and to take advantage of new opportunities.
Managing and maintaining the database and infrastructure: As databases become more complex and data volumes grow, companies are faced with the expense of hiring additional talent to manage their databases.
Removing limits on scalability: A business needs to grow if it’s going to survive, and its data management must grow along with it. But it’s very difficult to predict how much capacity the company will need, particularly with on-premises databases.
Ensuring data residency, data sovereignty, or latency requirements: Some organizations have use cases that are better suited to run on-premises. In those cases, engineered systems that are pre-configured and pre-optimized for running the database are ideal.

SQL Databases

A SQL (or relational) database is a collection of data items with pre-defined relationships between them. These items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database. Each column in a table holds a certain kind of data and a field stores the actual value of an attribute. The rows in the table represent a collection of related values of one object or entity.

Each row in a table could be marked with a unique identifier called a primary key, and rows among multiple tables can be made related using foreign keys. This data can be accessed in many different ways without re-organizing the database tables themselves. SQL databases usually follow the ACID consistency model.

Materialized Views

A materialized view is a pre-computed data set derived from a query specification and stored for later use. Because the data is pre-computed, querying a materialized view is faster than executing a query against the base table of the view. This performance difference can be significant when a query is run frequently or is sufficiently complex.

It also enables data subsetting and improves the performance of complex queries that run on large data sets which reduces network loads. There are other uses of materialized views, but they are mostly used for performance and replication.

N+1 Query Problem

The N+1 query problem happens when the data access layer executes N additional SQL statements to fetch the same data that could have been retrieved when executing the primary SQL query. The larger the value of N, the more queries will be executed, the larger the performance impact.

This is commonly seen in GraphQL and ORM (Object-Relational Mapping) tools and can be addressed by optimizing the SQL query or using a dataloader that batches consecutive requests and makes a single data request under the hood.

Advantages

Let’s look at some advantages of using relational databases:

Simple and accurate
Accessibility
Data consistency
Flexibility

Disadvantages

Below are the disadvantages of relational databases:

Expensive to maintain
Difficult schema evolution
Performance hits (join, denormalization, etc.)
Difficult to scale due to poor horizontal scalability

Examples

Here are some commonly used relational databases:

NoSQL Databases

NoSQL is a broad category that includes any database that doesn’t use SQL as its primary data access language. These types of databases are also sometimes referred to as non-relational databases. Unlike in relational databases, data in a NoSQL database doesn’t have to conform to a pre-defined schema. NoSQL databases follow BASE consistency model.

Below are different types of NoSQL databases:

Document

A document database (also known as a document-oriented database or a document store) is a database that stores information in documents. They are general-purpose databases that serve a variety of use cases for both transactional and analytical applications.

Advantages

Intuitive and flexible
Easy horizontal scaling
Schemaless

Disadvantages

Schemaless
Non-relational

Examples

Key-value

One of the simplest types of NoSQL databases, key-value databases save data as a group of key-value pairs made up of two data items each. They’re also sometimes referred to as a key-value store.

Advantages

Simple and performant
Highly scalable for high volumes of traffic
Session management
Optimized lookups

Disadvantages

Basic CRUD
Values can’t be filtered
Lacks indexing and scanning capabilities
Not optimized for complex queries

Examples

Graph

A graph database is a NoSQL database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data instead of tables or documents.

The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation.

Advantages

Query speed
Agile and flexible
Explicit data representation

Disadvantages

Complex
No standardized query language

Use cases

Fraud detection
Recommendation engines
Social networks
Network mapping

Examples

Time Series

A time-series database is a database optimized for time-stamped, or time series, data.

Advantages

Fast insertion and retrieval
Efficient data storage

Use cases

IoT data
Metrics analysis
Application monitoring
Understand financial trends

Examples

Wide Column

Wide column databases, also known as wide column stores, are schema-agnostic. Data is stored in column families, rather than in rows and columns.

Advantages

Highly scalable, can handle petabytes of data
Ideal for real-time big data applications

Disadvantages

Expensive
Increased write time

Use cases

Business analytics
Attribute-based data storage

Examples

Multi-model

Multi-model databases combine different database models (i.e. relational, graph, key-value, document, etc.) into a single, integrated backend. This means they can accommodate various data types, indexes, queries, and store data in more than one model.

Advantages

Flexibility
Suitable for complex projects
Data consistent

Disadvantages

Complex
Less mature

Examples

SQL Vs NoSQL Databases

In the world of databases, there are two main types of solutions, SQL (relational) and NoSQL (non-relational) databases. Both of them differ in the way they were built, the kind of information they store, and how they store it. Relational databases are structured and have predefined schemas while non-relational databases are unstructured, distributed, and have a dynamic schema.

High-level Differences

Here are some high-level differences between SQL and NoSQL:

Storage

SQL stores data in tables, where each row represents an entity and each column represents a data point about that entity.

NoSQL databases have different data storage models such as key-value, graph, document, etc.

Schema

In SQL, each record conforms to a fixed schema, meaning the columns must be decided and chosen before data entry and each row must have data for each column. The schema can be altered later, but it involves modifying the database using migrations.

Whereas in NoSQL, schemas are dynamic. Fields can be added on the fly, and each record (or equivalent) doesn’t have to contain data for each field.

Querying

SQL databases use SQL (structured query language) for defining and manipulating the data, which is very powerful.

In a NoSQL database, queries are focused on a collection of documents. Different databases have different syntax for querying.

Scalability

In most common situations, SQL databases are vertically scalable, which can get very expensive. It is possible to scale a relational database across multiple servers, but this is a challenging and time-consuming process.

On the other hand, NoSQL databases are horizontally scalable, meaning we can add more servers easily to our NoSQL database infrastructure to handle large traffic. Any cheap commodity hardware or cloud instances can host NoSQL databases, thus making it a lot more cost-effective than vertical scaling. A lot of NoSQL technologies also distribute data across servers automatically.

Reliability

The vast majority of relational databases are ACID compliant. So, when it comes to data reliability and a safe guarantee of performing transactions, SQL databases are still the better bet.

Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability.

Reasons

As always we should always pick the technology that fits the requirements better. So, let’s look at some reasons for picking SQL or NoSQL based database:

For SQL

Structured data with strict schema
Relational data
Need for complex joins
Transactions
Lookups by index are very fast

For NoSQL

Dynamic or flexible schema
Non-relational data
No need for complex joins
Very data-intensive workload
Very high throughput for IOPS

Database Replication

Replication is a process that involves sharing information to ensure consistency between redundant resources such as multiple databases, to improve reliability, fault-tolerance, or accessibility.

Master-Slave Replication

The master serves reads and writes, replicating writes to one or more slaves, which serve only reads. Slaves can also replicate additional slaves in a tree-like fashion. If the master goes offline, the system can continue to operate in read-only mode until a slave is promoted to a master or a new master is provisioned.

Advantages

Backups of the entire database of relatively no impact on the master.
Applications can read from the slave(s) without impacting the master.
Slaves can be taken offline and synced back to the master without any downtime.

Disadvantages

Replication adds more hardware and additional complexity.
Downtime and possibly loss of data when a master fails.
All writes also have to be made to the master in a master-slave architecture.
The more read slaves, the more we have to replicate, which will increase replication lag.

Master-Master Replication

Both masters serve reads/writes and coordinate with each other. If either master goes down, the system can continue to operate with both reads and writes.

Advantages

Applications can read from both masters.
Distributes write load across both master nodes.
Simple, automatic, and quick failover.

Disadvantages

Not as simple as master-slave to configure and deploy.
Either loosely consistent or have increased write latency due to synchronization.
Conflict resolution comes into play as more write nodes are added and as latency increases.

Synchronous Vs Asynchronous Replication

The primary difference between synchronous and asynchronous replication is how the data is written to the replica. In synchronous replication, data is written to primary storage and the replica simultaneously. As such, the primary copy and the replica should always remain synchronized.

In contrast, asynchronous replication copies the data to the replica after the data is already written to the primary storage. Although the replication process may occur in near-real-time, it is more common for replication to occur on a scheduled basis and it is more cost-effective.

Indexes

Indexes are well known when it comes to databases, they are used to improve the speed of data retrieval operations on the data store. An index makes the trade-offs of increased storage overhead, and slower writes (since we not only have to write the data but also have to update the index) for the benefit of faster reads. Indexes are used to quickly locate data without having to examine every row in a database table. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access to ordered records.

An index is a data structure that can be perceived as a table of contents that points us to the location where actual data lives. So when we create an index on a column of a table, we store that column and a pointer to the whole row in the index. Indexes are also used to create different views of the same data. For large data sets, this is an excellent way to specify different filters or sorting schemes without resorting to creating multiple additional copies of the data.

One quality that database indexes can have is that they can be dense or sparse. Each of these index qualities comes with its own trade-offs. Let’s look at how each index type would work:

Dense Index

In a dense index, an index record is created for every row of the table. Records can be located directly as each record of the index holds the search key value and the pointer to the actual record.

Dense indexes require more maintenance than sparse indexes at write-time. Since every row must have an entry, the database must maintain the index on inserts, updates, and deletes. Having an entry for every row also means that dense indexes will require more memory. The benefit of a dense index is that values can be quickly found with just a binary search. Dense indexes also do not impose any ordering requirements on the data.

Sparse Index

In a sparse index, index records are created only for some of the records.

Sparse indexes require less maintenance than dense indexes at write-time since they only contain a subset of the values. This lighter maintenance burden means that inserts, updates, and deletes will be faster. Having fewer entries also means that the index will use less memory. Finding data is slower since a scan across the page typically follows the binary search. Sparse indexes are also optional when working with ordered data.

Normalization and Denormalization

Terms

Before we go any further, let’s look at some commonly used terms in normalization and denormalization.

Keys

Primary key: Column or group of columns that can be used to uniquely identify every row of the table.

Composite key: A primary key made up of multiple columns.

Super key: Set of all keys that can uniquely identify all the rows present in a table.

Candidate key: Attributes that identify rows uniquely in a table.

Foreign key: It is a reference to a primary key of another table.

Alternate key: Keys that are not primary keys are known as alternate keys.

Surrogate key: A system-generated value that uniquely identifies each entry in a table when no other column was able to hold properties of a primary key.

Dependencies

Partial dependency: Occurs when the primary key determines some other attributes.

Functional dependency: It is a relationship that exists between two attributes, typically between the primary key and non-key attribute within a table.

Transitive functional dependency: Occurs when some non-key attribute determines some other attribute.

Anomalies

Database anomaly happens when there is a flaw in the database due to incorrect planning or storing everything in a flat database. This is generally addressed by the process of normalization.

There are three types of database anomalies:

Insertion anomaly: Occurs when we are not able to insert certain attributes in the database without the presence of other attributes.

Update anomaly: Occurs in case of data redundancy and partial update. In other words, a correct update of the database needs other actions such as addition, deletion, or both.

Deletion anomaly: Occurs where deletion of some data requires deletion of other data.

Example

Let’s consider the following table which is not normalized:

ID	Name	Role	Team
1	Peter	Software Engineer	A
2	Brian	DevOps Engineer	B
3	Hailey	Product Manager	C
4	Hailey	Product Manager	C
5	Steve	Frontend Engineer	D

Let’s imagine, we hired a new person “John” but they might not be assigned a team immediately. This will cause an insertion anomaly as the team attribute is not yet present.

Next, let’s say Hailey from Team C got promoted, to reflect that change in the database, we will need to update 2 rows to maintain consistency which can cause an update anomaly.

Finally, we would like to remove Team B but to do that we will also need to remove additional information such as name and role, this is an example of a deletion anomaly.

Normalization

Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.

Why Do We Need Normalization?

The goal of normalization is to eliminate redundant data and ensure data is consistent. A fully normalized database allows its structure to be extended to accommodate new types of data without changing the existing structure too much. As a result, applications interacting with the database are minimally affected.

Normal Forms

Normal forms are a series of guidelines to ensure that the database is normalized. Let’s discuss some essential normal forms:

1NF

For a table to be in the first normal form (1NF), it should follow the following rules:

Repeating groups are not permitted.
Identify each set of related data with a primary key.
Set of related data should have a separate table.
Mixing data types in the same column is not permitted.

2NF

For a table to be in the second normal form (2NF), it should follow the following rules:

Satisfies the first normal form (1NF).
Should not have any partial dependency.

3NF

For a table to be in the third normal form (3NF), it should follow the following rules:

Satisfies the second normal form (2NF).
Transitive functional dependencies are not permitted.

BCNF

Boyce-Codd normal form (or BCNF) is a slightly stronger version of the third normal form (3NF) used to address certain types of anomalies not dealt with by 3NF as originally defined. Sometimes it is also known as the 3.5 normal form (3.5NF).

For a table to be in the Boyce-Codd normal form (BCNF), it should follow the following rules:

Satisfied the third normal form (3NF).
For every functional dependency X → Y, X should be the super key.

There are more normal forms such as 4NF, 5NF, and 6NF but we won’t discuss them here. Check out this amazing video that goes into detail.

In a relational database, a relation is often described as “normalized” if it meets the third normal form. Most 3NF relations are free of insertion, update, and deletion anomalies.

As with many formal rules and specifications, real-world scenarios do not always allow for perfect compliance. If you decide to violate one of the first three rules of normalization, make sure that your application anticipates any problems that could occur, such as redundant data and inconsistent dependencies.

Advantages

Here are some advantages of normalization:

Reduces data redundancy.
Better data design.
Increases data consistency.
Enforces referential integrity.

Disadvantages

Let’s look at some disadvantages of normalization:

Data design is complex.
Slower performance.
Maintenance overhead.
Require more joins.

Denormalization

Denormalization is a database optimization technique in which we add redundant data to one or more tables. This can help us avoid costly joins in a relational database. It attempts to improve read performance at the expense of some write performance. Redundant copies of the data are written in multiple tables to avoid expensive joins.

Once data becomes distributed with techniques such as federation and sharding, managing joins across the network further increases complexity. Denormalization might circumvent the need for such complex joins.

Note: Denormalization does not mean reversing normalization.

Advantages

Let’s look at some advantages of denormalization:

Retrieving data is faster.
Writing queries is easier.
Reduction in number of tables.
Convenient to manage.

Disadvantages

Below are some disadvantages of denormalization:

Expensive inserts and updates.
Increases complexity of database design.
Increases data redundancy.
More chances of data inconsistency.

ACID and BASE Consistency Models

Let’s discuss the ACID and BASE consistency models.

ACID

The term ACID stands for Atomicity, Consistency, Isolation, and Durability. ACID properties are used for maintaining data integrity during transaction processing.

In order to maintain consistency before and after a transaction relational databases follow ACID properties. Let us understand these terms:

Atomic

All operations in a transaction succeed or every operation is rolled back.

Consistent

On the completion of a transaction, the database is structurally sound.

Isolated

Transactions do not contend with one another. Contentious access to data is moderated by the database so that transactions appear to run sequentially.

Durable

Once the transaction has been completed and the writes and updates have been written to the disk, it will remain in the system even if a system failure occurs.

BASE

With the increasing amount of data and high availability requirements, the approach to database design has also changed dramatically. To increase the ability to scale and at the same time be highly available, we move the logic from the database to separate servers. In this way, the database becomes more independent and focused on the actual process of storing data.

In the NoSQL database world, ACID transactions are less common as some databases have loosened the requirements for immediate consistency, data freshness, and accuracy in order to gain other benefits, like scale and resilience.

BASE properties are much looser than ACID guarantees, but there isn’t a direct one-for-one mapping between the two consistency models. Let us understand these terms:

Basic Availability

The database appears to work most of the time.

Soft-state

Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time.

Eventual Consistency

The data might not be consistent immediately but eventually, it becomes consistent. Reads in the system are still possible even though they may not give the correct response due to inconsistency.

ACID Vs BASE Trade-offs

There’s no right answer to whether our application needs an ACID or a BASE consistency model. Both the models have been designed to satisfy different requirements. While choosing a database we need to keep the properties of both the models and the requirements of our application in mind.

Given BASE’s loose consistency, developers need to be more knowledgeable and rigorous about consistent data if they choose a BASE store for their application. It’s essential to be familiar with the BASE behavior of the chosen database and work within those constraints.

On the other hand, planning around BASE limitations can sometimes be a major disadvantage when compared to the simplicity of ACID transactions. A fully ACID database is the perfect fit for use cases where data reliability and consistency are essential.

CAP Theorem

CAP theorem states that a distributed system can deliver only two of the three desired characteristics Consistency, Availability, and Partition tolerance (CAP).

Let’s take a detailed look at the three distributed system characteristics to which the CAP theorem refers.

Consistency

Consistency means that all clients see the same data at the same time, no matter which node they connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or replicated across all the nodes in the system before the write is deemed “successful”.

Availability

Availability means that any client making a request for data gets a response, even if one or more nodes are down.

Partition Tolerance

Partition tolerance means the system continues to work despite message loss or partial failure. A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.

Consistency-Availability Tradeoff

We live in a physical world and can’t guarantee the stability of a network, so distributed databases must choose Partition Tolerance (P). This implies a tradeoff between Consistency (C) and Availability (A).

CA Database

A CA database delivers consistency and availability across all nodes. It can’t do this if there is a partition between any two nodes in the system, and therefore can’t deliver fault tolerance.

Example: PostgreSQL, MariaDB.

CP Database

A CP database delivers consistency and partition tolerance at the expense of availability. When a partition occurs between any two nodes, the system has to shut down the non-consistent node until the partition is resolved.

Example: MongoDB, Apache HBase.

AP Database

An AP database delivers availability and partition tolerance at the expense of consistency. When a partition occurs, all nodes remain available but those at the wrong end of a partition might return an older version of data than others. When the partition is resolved, the AP databases typically re-syncs the nodes to repair all inconsistencies in the system.

Example: Apache Cassandra, CouchDB.

PACELC Theorem

The PACELC theorem is an extension of the CAP theorem. The CAP theorem states that in the case of network partitioning (P) in a distributed system, one has to choose between Availability (A) and Consistency (C).

PACELC extends the CAP theorem by introducing latency (L) as an additional attribute of a distributed system. The theorem states that else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C).

The PACELC theorem was first described by Daniel J. Abadi.

PACELC theorem was developed to address a key limitation of the CAP theorem as it makes no provision for performance or latency.

For example, according to the CAP theorem, a database can be considered available if a query returns a response after 30 days. Obviously, such latency would be unacceptable for any real-world application.

Transactions

A transaction is a series of database operations that are considered to be a “single unit of work”. The operations in a transaction either all succeed, or they all fail. In this way, the notion of a transaction supports data integrity when part of a system fails. Not all databases choose to support ACID transactions, usually because they are prioritizing other optimizations that are hard or theoretically impossible to implement together.

Usually, relational databases support ACID transactions, and non-relational databases don’t (there are exceptions).

States

A transaction in a database can be in one of the following states:

Active

In this state, the transaction is being executed. This is the initial state of every transaction.

Partially Committed

When a transaction executes its final operation, it is said to be in a partially committed state.

Committed

If a transaction executes all its operations successfully, it is said to be committed. All its effects are now permanently established on the database system.

Failed

The transaction is said to be in a failed state if any of the checks made by the database recovery system fails. A failed transaction can no longer proceed further.

Aborted

If any of the checks fail and the transaction has reached a failed state, then the recovery manager rolls back all its write operations on the database to bring the database back to its original state where it was prior to the execution of the transaction. Transactions in this state are aborted.

The database recovery module can select one of the two operations after a transaction aborts:

Restart the transaction
Kill the transaction

Terminated

If there isn’t any roll-back or the transaction comes from the committed state, then the system is consistent and ready for a new transaction and the old transaction is terminated.

Distributed Transactions

A distributed transaction is a set of operations on data that is performed across two or more databases. It is typically coordinated across separate nodes connected by a network, but may also span multiple databases on a single server.

Why Do We Need Distributed Transactions?

Unlike an ACID transaction on a single database, a distributed transaction involves altering data on multiple databases. Consequently, distributed transaction processing is more complicated, because the database must coordinate the committing or rollback of the changes in a transaction as a self-contained unit.

In other words, all the nodes must commit, or all must abort and the entire transaction rolls back. This is why we need distributed transactions.

Now, let’s look at some popular solutions for distributed transactions:

Two-Phase Commit

The two-phase commit (2PC) protocol is a distributed algorithm that coordinates all the processes that participate in a distributed transaction on whether to commit or abort (roll back) the transaction.

This protocol achieves its goal even in many cases of temporary system failure and is thus widely used. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed to remedy an outcome.

This protocol requires a coordinator node, which basically coordinates and oversees the transaction across different nodes. The coordinator tries to establish the consensus among a set of processes in two phases, hence the name.

Phases

Two-phase commit consists of the following phases:

Prepare phase

The prepare phase involves the coordinator node collecting consensus from each of the participant nodes. The transaction will be aborted unless each of the nodes responds that they’re prepared.

Commit phase

If all participants respond to the coordinator that they are prepared, then the coordinator asks all the nodes to commit the transaction. If a failure occurs, the transaction will be rolled back.

Problems

Following problems may arise in the two-phase commit protocol:

What if one of the nodes crashes?
What if the coordinator itself crashes?
It is a blocking protocol.

Three-phase Commit

Three-phase commit (3PC) is an extension of the two-phase commit where the commit phase is split into two phases. This helps with the blocking problem that occurs in the two-phase commit protocol.

Phases

Three-phase commit consists of the following phases:

Prepare phase

This phase is the same as the two-phase commit.

Pre-commit phase

Coordinator issues the pre-commit message and all the participating nodes must acknowledge it. If a participant fails to receive this message in time, then the transaction is aborted.

Commit phase

This step is also similar to the two-phase commit protocol.

Why is the Pre-commit Phase Helpful?

The pre-commit phase accomplishes the following:

If the participant nodes are found in this phase, that means that every participant has completed the first phase. The completion of prepare phase is guaranteed.
Every phase can now time out and avoid indefinite waits.

Sagas

A saga is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction in the saga. If a local transaction fails because it violates a business rule then the saga executes a series of compensating transactions that undo the changes that were made by the preceding local transactions.

Coordination

There are two common implementation approaches:

Choreography: Each local transaction publishes domain events that trigger local transactions in other services.
Orchestration: An orchestrator tells the participants what local transactions to execute.

Problems

The Saga pattern is particularly hard to debug.
There’s a risk of cyclic dependency between saga participants.
Lack of participant data isolation imposes durability challenges.
Testing is difficult because all services must be running to simulate a transaction.

Sharding

Before we discuss sharding, let’s talk about data partitioning:

Data Partitioning

Data partitioning is a technique to break up a database into many smaller parts. It is the process of splitting up a database or a table across multiple machines to improve the manageability, performance, and availability of a database.

Methods

There are many different ways one could use to decide how to break up an application database into multiple smaller DBs. Below are two of the most popular methods used by various large-scale applications:

Horizontal Partitioning (or Sharding)

In this strategy, we split the table data horizontally based on the range of values defined by the partition key. It is also referred to as database sharding.

Vertical Partitioning

In vertical partitioning, we partition the data vertically based on columns. We divide tables into relatively smaller tables with few elements, and each part is present in a separate partition.

In this tutorial, we will specifically focus on sharding.

What is Sharding?

Sharding is a database architecture pattern related to horizontal partitioning, which is the practice of separating one table’s rows into multiple different tables, known as partitions or shards. Each partition has the same schema and columns, but also a subset of the shared data. Likewise, the data held in each is unique and independent of the data held in other partitions.

The justification for data sharding is that, after a certain point, it is cheaper and more feasible to scale horizontally by adding more machines than to scale it vertically by adding powerful servers. Sharding can be implemented at both application or the database level.

Partitioning Criteria

There are a large number of criteria available for data partitioning. Some most commonly used criteria are:

Hash-Based

This strategy divides the rows into different partitions based on a hashing algorithm rather than grouping database rows based on continuous indexes.

The disadvantage of this method is that dynamically adding/removing database servers becomes expensive.

List-Based

In list-based partitioning, each partition is defined and selected based on the list of values on a column rather than a set of contiguous ranges of values.

Range Based

Range partitioning maps data to various partitions based on ranges of values of the partitioning key. In other words, we partition the table in such a way that each partition contains rows within a given range defined by the partition key.

Ranges should be contiguous but not overlapping, where each range specifies a non-inclusive lower and upper bound for a partition. Any partitioning key values equal to or higher than the upper bound of the range are added to the next partition.

Composite

As the name suggests, composite partitioning partitions the data based on two or more partitioning techniques. Here we first partition the data using one technique, and then each partition is further subdivided into sub-partitions using the same or some other method.

Advantages

But why do we need sharding? Here are some advantages:

Availability: Provides logical independence to the partitioned database, ensuring the high availability of our application. Here individual partitions can be managed independently.
Scalability: Proves to increase scalability by distributing the data across multiple partitions.
Security: Helps improve the system’s security by storing sensitive and non-sensitive data in different partitions. This could provide better manageability and security to sensitive data.
Query Performance: Improves the performance of the system. Instead of querying the whole database, now the system has to query only a smaller partition.
Data Manageability: Divides tables and indexes into smaller and more manageable units.

Disadvantages

Complexity: Sharding increases the complexity of the system in general.
Joins across shards: Once a database is partitioned and spread across multiple machines it is often not feasible to perform joins that span multiple database shards. Such joins will not be performance efficient since data has to be retrieved from multiple servers.
Rebalancing: If the data distribution is not uniform or there is a lot of load on a single shard, in such cases, we have to rebalance our shards so that the requests are as equally distributed among the shards as possible.

When to Use Sharding?

Here are some reasons why sharding might be the right choice:

Leveraging existing hardware instead of high-end machines.
Maintain data in distinct geographic regions.
Quickly scale by adding more shards.
Better performance as each machine is under less load.
When more concurrent connections are required.

Consistent Hashing

Let’s first understand the problem we’re trying to solve.

Why Do We Need This?

In traditional hashing-based distribution methods, we use a hash function to hash our partition keys (i.e. request ID or IP). Then if we use the modulo against the total number of nodes (server or databases). This will give us the node where we want to route our request.

\[ \begin{aligned} & Hash(key_1) \to H_1 \bmod N = Node_0 \\ & Hash(key_2) \to H_2 \bmod N = Node_1 \\ & Hash(key_3) \to H_3 \bmod N = Node_2 \\ & ... \\ & Hash(key_n) \to H_n \bmod N = Node_{n-1} \end{aligned} \]

Where,

key: Request ID or IP.

H: Hash function result.

N: Total number of nodes.

Node: The node where the request will be routed.

The problem with this is if we add or remove a node, it will cause N to change, meaning our mapping strategy will break as the same requests will now map to a different server. As a consequence, the majority of requests will need to be redistributed which is very inefficient.

We want to uniformly distribute requests among different nodes such that we should be able to add or remove nodes with minimal effort. Hence, we need a distribution scheme that does not depend directly on the number of nodes (or servers), so that, when adding or removing nodes, the number of keys that need to be relocated is minimized.

Consistent hashing solves this horizontal scalability problem by ensuring that every time we scale up or down, we do not have to re-arrange all the keys or touch all the servers.

Now that we understand the problem, let’s discuss consistent hashing in detail.

How Does it Work

Consistent Hashing is a distributed hashing scheme that operates independently of the number of nodes in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.

Using consistent hashing, only K/N data would require re-distributing.

\[ R = K/N \]

Where,

R: Data that would require re-distribution.

K: Number of partition keys.

N: Number of nodes.

The output of the hash function is a range let’s say 0…m-1 which we can represent on our hash ring. We hash the requests and distribute them on the ring depending on what the output was. Similarly, we also hash the node and distribute them on the same ring as well.

\[ \begin{aligned} & Hash(key_1) = P_1 \\ & Hash(key_2) = P_2 \\ & Hash(key_3) = P_3 \\ & ... \\ & Hash(key_n) = P_{m-1} \end{aligned} \]

Where,

key: Request/Node ID or IP.

P: Position on the hash ring.

m: Total range of the hash ring.

Now, when the request comes in we can simply route it to the closest node in a clockwise (can be counterclockwise as well) manner. This means that if a new node is added or removed, we can use the nearest node and only a fraction of the requests need to be re-routed.

In theory, consistent hashing should distribute the load evenly however it doesn’t happen in practice. Usually, the load distribution is uneven and one server may end up handling the majority of the request becoming a hotspot, essentially a bottleneck for the system. We can fix this by adding extra nodes but that can be expensive.

Let’s see how we can address these issues.

Virtual Nodes

In order to ensure a more evenly distributed load, we can introduce the idea of a virtual node, sometimes also referred to as a VNode.

Instead of assigning a single position to a node, the hash range is divided into multiple smaller ranges, and each physical node is assigned several of these smaller ranges. Each of these subranges is considered a VNode. Hence, virtual nodes are basically existing physical nodes mapped multiple times across the hash ring to minimize changes to a node’s assigned range.

For this, we can use k number of hash functions.

\[ \begin{aligned} & Hash_1(key_1) = P_1 \\ & Hash_2(key_2) = P_2 \\ & Hash_3(key_3) = P_3 \\ & . . . \\ & Hash_k(key_n) = P_{m-1} \end{aligned} \]

Where,

key: Request/Node ID or IP.

k: Number of hash functions.

P: Position on the hash ring.

m: Total range of the hash ring.

As VNodes help spread the load more evenly across the physical nodes on the cluster by diving the hash ranges into smaller subranges, this speeds up the re-balancing process after adding or removing nodes. This also helps us reduce the probability of hotspots.

Data Replication

To ensure high availability and durability, consistent hashing replicates each data item on multiple N nodes in the system where the value N is equivalent to the replication factor.

The replication factor is the number of nodes that will receive the copy of the same data. In eventually consistent systems, this is done asynchronously.

Advantages

Let’s look at some advantages of consistent hashing:

Makes rapid scaling up and down more predictable.
Facilitates partitioning and replication across nodes.
Enables scalability and availability.
Reduces hotspots.

Disadvantages

Below are some disadvantages of consistent hashing:

Increases complexity.
Cascading failures.
Load distribution can still be uneven.
Key management can be expensive when nodes transiently fail.

Examples

Let’s look at some examples where consistent hashing is used:

Data partitioning in Apache Cassandra.
Load distribution across multiple storage hosts in Amazon DynamoDB.

Database Federation

Federation (or functional partitioning) splits up databases by function. The federation architecture makes several distinct physical databases appear as one logical database to end-users.

All of the components in a federation are tied together by one or more federal schemas that express the commonality of data throughout the federation. These federated schemas are used to specify the information that can be shared by the federation components and to provide a common basis for communication among them.

Federation also provides a cohesive, unified view of data derived from multiple sources. The data sources for federated systems can include databases and various other forms of structured and unstructured data.

Characteristics

Let’s look at some key characteristics of a federated database:

Transparency: Federated database masks user differences and implementations of underlying data sources. Therefore, the users do not need to be aware of where the data is stored.
Heterogeneity: Data sources can differ in many ways. A federated database system can handle different hardware, network protocols, data models, etc.
Extensibility: New sources may be needed to meet the changing needs of the business. A good federated database system needs to make it easy to add new sources.
Autonomy: A Federated database does not change existing data sources, interfaces should remain the same.
Data integration: A federated database can integrate data from different protocols, database management systems, etc.

Advantages

Here are some advantages of federated databases:

Flexible data sharing.
Autonomy among the database components.
Access heterogeneous data in a unified way.
No tight coupling of applications with legacy databases.

Disadvantages

Below are some disadvantages of federated databases:

Adds more hardware and additional complexity.
Joining data from two databases is complex.
Dependence on autonomous data sources.
Query performance and scalability.