System design is a very extensive topic and system design interviews are designed to evaluate your capability to produce technical solutions to abstract problems, as such, they’re not designed for a specific answer. The unique aspect of system design interviews is the two-way nature between the candidate and the interviewer.
Expectations are quite different at different engineering levels as well. This is because someone with a lot of practical experience will approach it quite differently from someone who’s new in the industry. As a result, it’s hard to come up with a single strategy that will help us stay organized during the interview.
Let’s look at some common strategies for system design interviews:
System design interview questions, by nature, are vague or abstract. Asking questions about the exact scope of the problem, and clarifying functional requirements early in the interview is essential. Usually, requirements are divided into three parts:
These are the requirements that the end user specifically demands as basic functionalities that the system should offer. All these functionalities need to be necessarily incorporated into the system as part of the contract.
For example:
These are the quality constraints that the system must satisfy according to the project contract. The priority or extent to which these factors are implemented varies from one project to another. They are also called non-behavioral requirements. For example, portability, maintainability, reliability, scalability, security, etc.
For example:
These are basically “nice to have” requirements that might be out of the scope of the system.
For example:
Estimate the scale of the system we’re going to design. It is important to ask questions such as:
These questions will help us scale our design later.
Once we have the estimations, we can start with defining the database schema. Doing so in the early stages of the interview would help us to understand the data flow which is the core of every system. In this step, we basically define all the entities and relationships between them.
Next, we can start designing APIs for the system. These APIs will help us define the expectations from the system explicitly. We don’t have to write any code, just a simple interface defining the API requirements such as parameters, functions, classes, types, entities, etc.
For example:
createUser(name: string, email: string): User
It is advised to keep the interface as simple as possible and come back to it later when covering extended requirements.
Now we have established our data model and API design, it’s time to identify system components (such as Load Balancers, API Gateway, etc.) that are needed to solve our problem and draft the first design of our system.
Once we have a basic diagram, we can start discussing with the interviewer how the system will work from the client’s perspective.
Now it’s time to go into detail about the major components of the system we designed. As always discuss with the interviewer which component may need further improvements.
Here is a good opportunity to demonstrate your experience in the areas of your expertise. Present different approaches, advantages, and disadvantages. Explain your design decisions, and back them up with examples. This is also a good time to discuss any additional features the system might be able to support, though this is optional.
Also, try not to be too opinionated about certain technologies, statements like “I believe that NoSQL databases are just better, SQL databases are not scalable” reflect poorly. As someone who has interviewed a lot of people over the years, my two cents here would be to be humble about what you know and what you do not. Use your existing knowledge with examples to navigate this part of the interview.
Finally, it’s time to discuss bottlenecks and approaches to mitigate them. Here are some important questions to ask:
Make sure to read the engineering blog of the company you’re interviewing with. This will help you get a sense of what technology stack they’re using and which problems are important to them.
Let’s design a URL shortener, similar to services like Bitly, TinyURL.
A URL shortener service creates an alias or a short URL for a long URL. Users are redirected to the original URL when they visit these short links.
For example, the following long URL can be changed to a shorter URL.
Long URL: https://karanpratapsingh.com/courses/system-design/url-shortener
Short URL: https://bit.ly/3I71d3o
URL shortener saves space in general when we are sharing URLs. Users are also less likely to mistype shorter URLs. Moreover, we can also optimize links across devices, this allows us to track individual links.
Our URL shortening system should meet the following requirements:
Let’s start with the estimation and constraints.
Note: Make sure to check any scale or traffic related assumptions with your interviewer.
This will be a read-heavy system, so let’s assume a
100:1
read/write ratio with 100 million links generated per
month.
Reads/Writes Per month
For reads per month:
\[ 100 \times 100 \space million = 10 \space billion/month \]
Similarly for writes:
\[ 1 \times 100 \space million = 100 \space million/month \]
What would be Requests Per Second (RPS) for our system?
100 million requests per month translate into 40 requests per second.
\[ \frac{100 \space million}{(30 \space days \times 24 \space hrs \times 3600 \space seconds)} = \sim 40 \space URLs/second \]
And with a 100:1
read/write ratio, the number of
redirections will be:
\[ 100 \times 40 \space URLs/second = 4000 \space requests/second \]
Since we expect about 40 URLs every second, and if we assume each request is of size 500 bytes then the total incoming data for write requests would be:
\[ 40 \times 500 \space bytes = 20 \space KB/second \]
Similarly, for the read requests, since we expect about 4K redirections, the total outgoing data would be:
\[ 4000 \space URLs/second \times 500 \space bytes = \sim 2 \space MB/second \]
For storage, we will assume we store each link or record in our database for 10 years. Since we expect around 100M new requests every month, the total number of records we will need to store would be:
\[ 100 \space million \times 10\space years \times 12 \space months = 12 \space billion \]
Like earlier, if we assume each stored record will be approximately 500 bytes. We will need around 6TB of storage:
\[ 12 \space billion \times 500 \space bytes = 6 \space TB \]
For caching, we will follow the classic Pareto principle also known as the 80/20 rule. This means that 80% of the requests are for 20% of the data, so we can cache around 20% of our requests.
Since we get around 4K read or redirection requests each second, this translates into 350M requests per day.
\[ 4000 \space URLs/second \times 24 \space hours \times 3600 \space seconds = \sim 350 \space million \space requests/day \]
Hence, we will need around 35GB of memory per day.
\[ 20 \space percent \times 350 \space million \times 500 \space bytes = 35 \space GB/day \]
Here is our high-level estimate:
Type | Estimate |
---|---|
Writes (New URLs) | 40/s |
Reads (Redirection) | 4K/s |
Bandwidth (Incoming) | 20 KB/s |
Bandwidth (Outgoing) | 2 MB/s |
Storage (10 years) | 6 TB |
Memory (Caching) | ~35 GB/day |
Next, we will focus on the data model design. Here is our database schema:
Initially, we can get started with just two tables:
users
Stores user’s details such as name
, email
,
createdAt
, etc.
urls
Contains the new short URL’s properties such as
expiration
, hash
, originalURL
,
and userID
of the user who created the short URL. We can
also use the hash
column as an index to improve the query
performance.
Since the data is not strongly relational, NoSQL databases such as Amazon DynamoDB, Apache Cassandra, or MongoDB will be a better choice here, if we do decide to use an SQL database then we can use something like Azure SQL Database or Amazon RDS.
For more details, refer to SQL vs NoSQL.
Let us do a basic API design for our services:
This API should create a new short URL in our system given an original URL.
createURL(apiKey: string, originalURL: string, expiration?: Date): string
Parameters
API Key (string
): API key provided by the user.
Original URL (string
): Original URL to be shortened.
Expiration (Date
): Expiration date of the new URL
(optional).
Returns
Short URL (string
): New shortened URL.
This API should retrieve the original URL from a given short URL.
getURL(apiKey: string, shortURL: string): string
Parameters
API Key (string
): API key provided by the user.
Short URL (string
): Short URL mapped to the original
URL.
Returns
Original URL (string
): Original URL to be retrieved.
This API should delete a given shortURL from our system.
deleteURL(apiKey: string, shortURL: string): boolean
Parameters
API Key (string
): API key provided by the user.
Short URL (string
): Short URL to be deleted.
Returns
Result (boolean
): Represents whether the operation was
successful or not.
As you must’ve noticed, we’re using an API key to prevent abuse of our services. Using this API key we can limit the users to a certain number of requests per second or minute. This is quite a standard practice for developer APIs and should cover our extended requirement.
Now let us do a high-level design of our system.
Our system’s primary goal is to shorten a given URL, let’s look at different approaches:
Base62 Approach
In this approach, we can encode the original URL using Base62 which consists of the capital letters A-Z, the lower case letters a-z, and the numbers 0-9.
\[ Number \space of \space URLs = 62^N \]
Where,
N
: Number of characters in the generated URL.
So, if we want to generate a URL that is 7 characters long, we will generate ~3.5 trillion different URLs.
\[ \begin{gathered} 62^5 = \sim 916 \space million \space URLs \\ 62^6 = \sim 56.8 \space billion \space URLs \\ 62^7 = \sim 3.5 \space trillion \space URLs \end{gathered} \]
This is the simplest solution here, but it does not guarantee non-duplicate or collision-resistant keys.
MD5 Approach
The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value (or 32 hexadecimal digits). We can use these 32 hexadecimal digits for generating 7 characters long URL.
\[ MD5(original\_url) \rightarrow base62encode \rightarrow hash \]
However, this creates a new issue for us, which is duplication and collision. We can try to re-compute the hash until we find a unique one but that will increase the overhead of our systems. It’s better to look for more scalable approaches.
Counter Approach
In this approach, we will start with a single server which will maintain the count of the keys generated. Once our service receives a request, it can reach out to the counter which returns a unique number and increments the counter. When the next request comes the counter again returns the unique number and this goes on.
\[ Counter(0-3.5 \space trillion) \rightarrow base62encode \rightarrow hash \]
The problem with this approach is that it can quickly become a single point for failure. And if we run multiple instances of the counter we can have collision as it’s essentially a distributed system.
To solve this issue we can use a distributed system manager such as Zookeeper which can provide distributed synchronization. Zookeeper can maintain multiple ranges for our servers.
\[ \begin{aligned} & Range \space 1: \space 1 \rightarrow 1,000,000 \\ & Range \space 2: \space 1,000,001 \rightarrow 2,000,000 \\ & Range \space 3: \space 2,000,001 \rightarrow 3,000,000 \\ & ... \end{aligned} \]
Once a server reaches its maximum range Zookeeper will assign an unused counter range to the new server. This approach can guarantee non-duplicate and collision-resistant URLs. Also, we can run multiple instances of Zookeeper to remove the single point of failure.
As we discussed, generating a unique key at scale without duplication and collisions can be a bit of a challenge. To solve this problem, we can create a standalone Key Generation Service (KGS) that generates a unique key ahead of time and stores it in a separate database for later use. This approach can make things simple for us.
How to handle concurrent access?
Once the key is used, we can mark it in the database to make sure we don’t reuse it, however, if there are multiple server instances reading data concurrently, two or more servers might try to use the same key.
The easiest way to solve this would be to store keys in two tables. As soon as a key is used, we move it to a separate table with appropriate locking in place. Also, to improve reads, we can keep some of the keys in memory.
KGS database estimations
As per our discussion, we can generate up to ~56.8 billion unique 6 character long keys which will result in us having to store 300 GB of keys.
\[ 6 \space characters \times 56.8 \space billion = \sim 390 \space GB \]
While 390 GB seems like a lot for this simple use case, it is important to remember this is for the entirety of our service lifetime and the size of the keys database would not increase like our main database.
Now, let’s talk about caching. As per our estimations, we will require around ~35 GB of memory per day to cache 20% of the incoming requests to our services. For this use case, we can use Redis or Memcached servers alongside our API server.
For more details, refer to caching.
Now that we have identified some core components, let’s do the first draft of our system design.
Here’s how it works:
Creating a new URL
Accessing a URL
It’s time to discuss the finer details of our design.
To scale out our databases we will need to partition our data. Horizontal partitioning (aka Sharding) can be a good first step. We can use partitions schemes such as:
The above approaches can still cause uneven data and load distribution, we can solve this using Consistent hashing.
For more details, refer to Sharding and Consistent Hashing.
This is more of a maintenance step for our services and depends on whether we keep the expired entries or remove them. If we do decide to remove expired entries, we can approach this in two different ways:
Active cleanup
In active cleanup, we will run a separate cleanup service which will periodically remove expired links from our storage and cache. This will be a very lightweight service like a cron job.
Passive cleanup
For passive cleanup, we can remove the entry when a user tries to access an expired link. This can ensure a lazy cleanup of our database and cache.
Now let us talk about caching.
Which cache eviction policy to use?
As we discussed before, we can use solutions like Redis or Memcached and cache 20% of the daily traffic but what kind of cache eviction policy would best fit our needs?
Least Recently Used (LRU) can be a good policy for our system. In this policy, we discard the least recently used key first.
How to handle cache miss?
Whenever there is a cache miss, our servers can hit the database directly and update the cache with the new entries.
Recording analytics and metrics is one of our extended requirements. We can store and update metadata like visitor’s country, platform, the number of views, etc alongside the URL entry in our database.
For security, we can introduce private URLs and authorization. A separate table can be used to store user ids that have permission to access a specific URL. If a user does not have proper permissions, we can return an HTTP 401 (Unauthorized) error.
We can also use an API Gateway as they can support capabilities like authorization, rate limiting, and load balancing out of the box.
Let us identify and resolve bottlenecks such as single points of failure in our design:
To make our system more resilient we can do the following:
Let’s design a WhatsApp like instant messaging service, similar to services like Facebook Messenger, and WeChat.
WhatsApp is a chat application that provides instant messaging services to its users. It is one of the most used mobile applications on the planet, connecting over 2 billion users in 180+ countries. WhatsApp is also available on the web.
Our system should meet the following requirements:
Let’s start with the estimation and constraints.
Note: Make sure to check any scale or traffic-related assumptions with your interviewer.
Let us assume we have 50 million daily active users (DAU) and on average each user sends at least 10 messages to 4 different people every day. This gives us 2 billion messages per day.
\[ 50 \space million \times 40 \space messages = 2 \space billion/day \]
Messages can also contain media such as images, videos, or other files. We can assume that 5 percent of messages are media files shared by the users, which gives us additional 100 million files we would need to store.
\[ 5 \space percent \times 2 \space billion = 100 \space million/day \]
What would be Requests Per Second (RPS) for our system?
2 billion requests per day translate into 24K requests per second.
\[ \frac{2 \space billion}{(24 \space hrs \times 3600 \space seconds)} = \sim 24K \space requests/second \]
If we assume each message on average is 100 bytes, we will require about 200 GB of database storage every day.
\[ 2 \space billion \times 100 \space bytes = \sim 200 \space GB/day \]
As per our requirements, we also know that around 5 percent of our daily messages (100 million) are media files. If we assume each file is 100 KB on average, we will require 10 TB of storage every day.
\[ 100 \space million \times 100 \space KB = 10 \space TB/day \]
And for 10 years, we will require about 38 PB of storage.
\[ (10 \space TB + 0.2 \space TB) \times 10 \space years \times 365 \space days = \sim 38 \space PB \]
As our system is handling 10.2 TB of ingress every day, we will require a minimum bandwidth of around 120 MB per second.
\[ \frac{10.2 \space TB}{(24 \space hrs \times 3600 \space seconds)} = \sim 120 \space MB/second \]
Here is our high-level estimate:
Type | Estimate |
---|---|
Daily active users (DAU) | 50 million |
Requests per second (RPS) | 24K/s |
Storage (per day) | ~10.2 TB |
Storage (10 years) | ~38 PB |
Bandwidth | ~120 MB/s |
This is the general data model which reflects our requirements.
We have the following tables:
users
This table will contain a user’s information such as
name
, phoneNumber
, and other details.
messages
As the name suggests, this table will store messages with properties
such as type
(text, image, video, etc.),
content
, and timestamps for message delivery. The message
will also have a corresponding chatID
or
groupID
.
chats
This table basically represents a private chat between two users and can contain multiple messages.
users_chats
This table maps users and chats as multiple users can have multiple chats (N:M relationship) and vice versa.
groups
This table represents a group made up of multiple users.
users_groups
This table maps users and groups as multiple users can be a part of multiple groups (N:M relationship) and vice versa.
While our data model seems quite relational, we don’t necessarily need to store everything in a single database, as this can limit our scalability and quickly become a bottleneck.
We will split the data between different services each having ownership over a particular table. Then we can use a relational database such as PostgreSQL or a distributed NoSQL database such as Apache Cassandra for our use case.
Let us do a basic API design for our services:
This API will get all chats or groups for a given
userID
.
getAll(userID: UUID): Chat[] | Group[]
Parameters
User ID (UUID
): ID of the current user.
Returns
Result (Chat[] | Group[]
): All the chats and groups the
user is a part of.
Get all messages for a user given the channelID
(chat or
group id).
getMessages(userID: UUID, channelID: UUID): Message[]
Parameters
User ID (UUID
): ID of the current user.
Channel ID (UUID
): ID of the channel (chat or group)
from which messages need to be retrieved.
Returns
Messages (Message[]
): All the messages in a given chat
or group.
Send a message from a user to a channel (chat or group).
sendMessage(userID: UUID, channelID: UUID, message: Message): boolean
Parameters
User ID (UUID
): ID of the current user.
Channel ID (UUID
): ID of the channel (chat or group)
user wants to send a message to.
Message (Message
): The message (text, image, video,
etc.) that the user wants to send.
Returns
Result (boolean
): Represents whether the operation was
successful or not.
Allows the user to join or leave a channel (chat or group).
joinGroup(userID: UUID, channelID: UUID): boolean
leaveGroup(userID: UUID, channelID: UUID): boolean
Parameters
User ID (UUID
): ID of the current user.
Channel ID (UUID
): ID of the channel (chat or group) the
user wants to join or leave.
Returns
Result (boolean
): Represents whether the operation was
successful or not.
Now let us do a high-level design of our system.
We will be using microservices architecture since it will make it easier to horizontally scale and decouple our services. Each service will have ownership of its own data model. Let’s try to divide our system into some core services.
User Service
This is an HTTP-based service that handles user-related concerns such as authentication and user information.
Chat Service
The chat service will use WebSockets to establish connections with the client to handle chat and group message-related functionality. We can also use cache to keep track of all the active connections, sort of like sessions which will help us determine if the user is online or not.
Notification Service
This service will simply send push notifications to the users. It will be discussed in detail separately.
Presence Service
The presence service will keep track of the last seen status of all users. It will be discussed in detail separately.
Media service
This service will handle the media (images, videos, files, etc.) uploads. It will be discussed in detail separately.
What about inter-service communication and service discovery?
Since our architecture is microservices-based, services will be communicating with each other as well. Generally, REST or HTTP performs well but we can further improve the performance using gRPC which is more lightweight and efficient.
Service discovery is another thing we will have to take into account. We can also use a service mesh that enables managed, observable, and secure communication between individual services.
Note: Learn more about REST, GraphQL, gRPC and how they compare with each other.
How do we efficiently send and receive messages? We have two different options:
Pull model
The client can periodically send an HTTP request to servers to check if there are any new messages. This can be achieved via something like Long polling.
Push model
The client opens a long-lived connection with the server and once new data is available it will be pushed to the client. We can use WebSockets for this.
The pull model approach is not scalable as it will create unnecessary request overhead on our servers and most of the time the response will be empty, thus wasting our resources. To minimize latency, using the push model with WebSockets which are only unidirectional.
Note: Learn more about Long polling, WebSockets, Server-Sent Events (SSE).
To implement the last seen functionality, we can use a heartbeat mechanism, where the client can periodically ping the servers indicating its liveness. Since this needs to be as low overhead as possible, we can store the last active timestamp in the cache as follows:
Key | Value |
---|---|
User A | 2022-07-01T14:32:50 |
User B | 2022-07-05T05:10:35 |
User C | 2022-07-10T04:33:25 |
This will give us the last time the user was active. This functionality will be handled by the presence service combined with Redis or Memcached as our cache.
Another way to implement this is to track the latest action of the user, once the last activity crosses a certain threshold, such as “user hasn’t performed any action in the last 30 seconds”, we can show the user as offline and last seen with the last recorded timestamp. This will be more of a lazy update approach and might benefit us over heartbeat mechanism in certain cases.
Once a message is sent in a chat or a group, we will first check if the recipient is active or not, we can get this information by taking the user’s active connection and last seen into consideration.
If the recipient is not active, the chat service will add an event to a message queue with additional metadata such as the client’s device platform which will be used to route the notification to the correct platform later on.
The notification service will then consume the event from the message queue and forward the request to Firebase Cloud Messaging (FCM) or Apple Push Notification Service (APNS) based on the client’s device platform (Android, iOS, web, etc). We can also add support for email and SMS.
Why are we using a message queue?
Since most message queues provide best-effort ordering which ensures that messages are generally delivered in the same order as they’re sent and that a message is delivered at least once which is an important part of our service functionality.
While this seems like a classic publish-subscribe use case, it is actually not as mobile devices and browsers each have their own way of handling push notifications. Usually, notifications are handled externally via Firebase Cloud Messaging (FCM) or Apple Push Notification Service (APNS) unlike message fan-out which we commonly see in backend services. We can use something like Amazon SQS or RabbitMQ to support this functionality.
Handling read receipts can be tricky, for this use case we can wait
for some sort of Acknowledgment
(ACK) from the client to determine if the message was delivered and
update the corresponding deliveredAt
field. Similarly, we
will mark the message as seen once the user opens the chat and update
the corresponding seenAt
timestamp field.
Now that we have identified some core components, let’s do the first draft of our system design.
It’s time to discuss our design decisions in detail.
To scale out our databases we will need to partition our data. Horizontal partitioning (aka Sharding) can be a good first step. We can use partitions schemes such as:
The above approaches can still cause uneven data and load distribution, we can solve this using Consistent hashing.
For more details, refer to Sharding and Consistent Hashing.
In a messaging application, we have to be careful about using cache as our users expect the latest data, but many users will be requesting the same messages, especially in a group chat. So, to prevent usage spikes from our resources we can cache older messages.
Some group chats can have thousands of messages and sending that over the network will be really inefficient, to improve efficiency we can add pagination to our system APIs. This decision will be helpful for users with limited network bandwidth as they won’t have to retrieve old messages unless requested.
Which cache eviction policy to use?
We can use solutions like Redis or Memcached and cache 20% of the daily traffic but what kind of cache eviction policy would best fit our needs?
Least Recently Used (LRU) can be a good policy for our system. In this policy, we discard the least recently used key first.
How to handle cache miss?
Whenever there is a cache miss, our servers can hit the database directly and update the cache with the new entries.
For more details, refer to Caching.
As we know, most of our storage space will be used for storing media files such as images, videos, or other files. Our media service will be handling both access and storage of the user media files.
But where can we store files at scale? Well, object storage or GlusterFS.
Fun fact: WhatsApp deletes media on its servers once it has been downloaded by the user.
We can use object stores like Amazon S3, Azure Blob Storage, or Google Cloud Storage for this use case.
Content Delivery Network (CDN) increases content availability and redundancy while reducing bandwidth costs. Generally, static files such as images, and videos are served from CDN. We can use services like Amazon CloudFront or Cloudflare CDN for this use case.
Since we will be using multiple protocols like HTTP, WebSocket, TCP/IP, deploying multiple L4 (transport layer) or L7 (application layer) type load balancers separately for each protocol will be expensive. Instead, we can use an API Gateway that supports multiple protocols without any issues.
API Gateway can also offer other features such as authentication, authorization, rate limiting, throttling, and API versioning which will improve the quality of our services.
We can use services like Amazon API Gateway or Azure API Gateway for this use case.
Let us identify and resolve bottlenecks such as single points of failure in our design:
To make our system more resilient we can do the following:
Let’s design a Twitter like social media service, similar to services like Facebook, Instagram, etc.
Twitter is a social media service where users can read or post short messages (up to 280 characters) called tweets. It is available on the web and mobile platforms such as Android and iOS.
Our system should meet the following requirements:
Let’s start with the estimation and constraints.
Note: Make sure to check any scale or traffic-related assumptions with your interviewer.
This will be a read-heavy system, let us assume we have 1 billion total users with 200 million daily active users (DAU), and on average each user tweets 5 times a day. This gives us 1 billion tweets per day.
\[ 200 \space million \times 5 \space tweets = 1 \space billion/day \]
Tweets can also contain media such as images, or videos. We can assume that 10 percent of tweets are media files shared by the users, which gives us additional 100 million files we would need to store.
\[ 10 \space percent \times 1 \space billion = 100 \space million/day \]
What would be Requests Per Second (RPS) for our system?
1 billion requests per day translate into 12K requests per second.
\[ \frac{1 \space billion}{(24 \space hrs \times 3600 \space seconds)} = \sim 12K \space requests/second \]
If we assume each message on average is 100 bytes, we will require about 100 GB of database storage every day.
\[ 1 \space billion \times 100 \space bytes = \sim 100 \space GB/day \]
We also know that around 10 percent of our daily messages (100 million) are media files per our requirements. If we assume each file is 50 KB on average, we will require 5 TB of storage every day.
\[ 100 \space million \times 50 \space KB = 5 \space TB/day \]
And for 10 years, we will require about 19 PB of storage.
\[ (5 \space TB + 0.1 \space TB) \times 365 \space days \times 10 \space years = \sim 19 \space PB \]
As our system is handling 5.1 TB of ingress every day, we will require a minimum bandwidth of around 60 MB per second.
\[ \frac{5.1 \space TB}{(24 \space hrs \times 3600 \space seconds)} = \sim 60 \space MB/second \]
Here is our high-level estimate:
Type | Estimate |
---|---|
Daily active users (DAU) | 100 million |
Requests per second (RPS) | 12K/s |
Storage (per day) | ~5.1 TB |
Storage (10 years) | ~19 PB |
Bandwidth | ~60 MB/s |
This is the general data model which reflects our requirements.
We have the following tables:
users
This table will contain a user’s information such as
name
, email
, dob
, and other
details.
tweets
As the name suggests, this table will store tweets and their
properties such as type
(text, image, video, etc.),
content
, etc. We will also store the corresponding
userID
.
favorites
This table maps tweets with users for the favorite tweets functionality in our application.
followers
This table maps the followers and followees as users can follow each other (N:M relationship).
feeds
This table stores feed properties with the corresponding
userID
.
feeds_tweets
This table maps tweets and feed (N:M relationship).
While our data model seems quite relational, we don’t necessarily need to store everything in a single database, as this can limit our scalability and quickly become a bottleneck.
We will split the data between different services each having ownership over a particular table. Then we can use a relational database such as PostgreSQL or a distributed NoSQL database such as Apache Cassandra for our use case.
Let us do a basic API design for our services:
This API will allow the user to post a tweet on the platform.
postTweet(userID: UUID, content: string, mediaURL?: string): boolean
Parameters
User ID (UUID
): ID of the user.
Content (string
): Contents of the tweet.
Media URL (string
): URL of the attached media
(optional).
Returns
Result (boolean
): Represents whether the operation was
successful or not.
This API will allow the user to follow or unfollow another user.
follow(followerID: UUID, followeeID: UUID): boolean
unfollow(followerID: UUID, followeeID: UUID): boolean
Parameters
Follower ID (UUID
): ID of the current user.
Followee ID (UUID
): ID of the user we want to follow or
unfollow.
Media URL (string
): URL of the attached media
(optional).
Returns
Result (boolean
): Represents whether the operation was
successful or not.
This API will return all the tweets to be shown within a given newsfeed.
getNewsfeed(userID: UUID): Tweet[]
Parameters
User ID (UUID
): ID of the user.
Returns
Tweets (Tweet[]
): All the tweets to be shown within a
given newsfeed.
Now let us do a high-level design of our system.
We will be using microservices architecture since it will make it easier to horizontally scale and decouple our services. Each service will have ownership of its own data model. Let’s try to divide our system into some core services.
User Service
This service handles user-related concerns such as authentication and user information.
Newsfeed Service
This service will handle the generation and publishing of user newsfeeds. It will be discussed in detail separately.
Tweet Service
The tweet service will handle tweet-related use cases such as posting a tweet, favorites, etc.
Search Service
The service is responsible for handling search-related functionality. It will be discussed in detail separately.
Media service
This service will handle the media (images, videos, files, etc.) uploads. It will be discussed in detail separately.
Notification Service
This service will simply send push notifications to the users.
Analytics Service
This service will be used for metrics and analytics use cases.
What about inter-service communication and service discovery?
Since our architecture is microservices-based, services will be communicating with each other as well. Generally, REST or HTTP performs well but we can further improve the performance using gRPC which is more lightweight and efficient.
Service discovery is another thing we will have to take into account. We can also use a service mesh that enables managed, observable, and secure communication between individual services.
Note: Learn more about REST, GraphQL, gRPC and how they compare with each other.
When it comes to the newsfeed, it seems easy enough to implement, but there are a lot of things that can make or break this feature. So, let’s divide our problem into two parts:
Generation
Let’s assume we want to generate the feed for user A, we will perform the following steps:
Feed generation is an intensive process and can take quite a lot of time, especially for users following a lot of people. To improve the performance, the feed can be pre-generated and stored in the cache, then we can have a mechanism to periodically update the feed and apply our ranking algorithm to the new tweets.
Publishing
Publishing is the step where the feed data is pushed according to each specific user. This can be a quite heavy operation, as a user may have millions of friends or followers. To deal with this, we have three different approaches:
When a user creates a tweet, and a follower reloads their newsfeed, the feed is created and stored in memory. The most recent feed is only loaded when the user requests it. This approach reduces the number of write operations on our database.
The downside of this approach is that the users will not be able to view recent feeds unless they “pull” the data from the server, which will increase the number of read operations on the server.
In this model, once a user creates a tweet, it is “pushed” to all the follower’s feeds immediately. This prevents the system from having to go through a user’s entire followers list to check for updates.
However, the downside of this approach is that it would increase the number of write operations on the database.
A third approach is a hybrid model between the pull and push model. It combines the beneficial features of the above two models and tries to provide a balanced approach between the two.
The hybrid model allows only users with a lesser number of followers to use the push model. For users with a higher number of followers such as celebrities, the pull model is used.
As we discussed, we will need a ranking algorithm to rank each tweet according to its relevance to each specific user.
For example, Facebook used to utilize an EdgeRank algorithm. Here, the rank of each feed item is described by:
\[ Rank = Affinity \times Weight \times Decay \]
Where,
Affinity
: is the “closeness” of the user to the creator
of the edge. If a user frequently likes, comments, or messages the edge
creator, then the value of affinity will be higher, resulting in a
higher rank for the post.
Weight
: is the value assigned according to each edge. A
comment can have a higher weightage than likes, and thus a post with
more comments is more likely to get a higher rank.
Decay
: is the measure of the creation of the edge. The
older the edge, the lesser will be the value of decay and eventually the
rank.
Nowadays, algorithms are much more complex and ranking is done using machine learning models which can take thousands of factors into consideration.
Retweets are one of our extended requirements. To implement this
feature, we can simply create a new tweet with the user id of the user
retweeting the original tweet and then modify the type
enum
and content
property of the new tweet to link it with the
original tweet.
For example, the type
enum property can be of type
tweet, similar to text, video, etc and content
can be the
id of the original tweet. Here the first row indicates the original
tweet while the second row is how we can represent a retweet.
id | userID | type | content | createdAt |
---|---|---|---|---|
ad34-291a-45f6-b36c | 7a2c-62c4-4dc8-b1bb | text | Hey, this is my first tweet… | 1658905644054 |
f064-49ad-9aa2-84a6 | 6aa2-2bc9-4331-879f | tweet | ad34-291a-45f6-b36c | 1658906165427 |
This is a very basic implementation. To improve this we can create a separate table itself to store retweets.
Sometimes traditional DBMS are not performant enough, we need something which allows us to store, search, and analyze huge volumes of data quickly and in near real-time and give results within milliseconds. Elasticsearch can help us with this use case.
Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. It is built on top of Apache Lucene.
How do we identify trending topics?
Trending functionality will be based on top of the search
functionality. We can cache the most frequently searched queries,
hashtags, and topics in the last N
seconds and update them
every M
seconds using some sort of batch job mechanism. Our
ranking algorithm can also be applied to the trending topics to give
them more weight and personalize them for the user.
Push notifications are an integral part of any social media platform. We can use a message queue or a message broker such as Apache Kafka with the notification service to dispatch requests to Firebase Cloud Messaging (FCM) or Apple Push Notification Service (APNS) which will handle the delivery of the push notifications to user devices.
For more details, refer to the WhatsApp system design where we discuss push notifications in detail.
It’s time to discuss our design decisions in detail.
To scale out our databases we will need to partition our data. Horizontal partitioning (aka Sharding) can be a good first step. We can use partitions schemes such as:
The above approaches can still cause uneven data and load distribution, we can solve this using Consistent hashing.
For more details, refer to Sharding and Consistent Hashing.
For mutual friends, we can build a social graph for every user. Each node in the graph will represent a user and a directional edge will represent followers and followees. After that, we can traverse the followers of a user to find and suggest a mutual friend. This would require a graph database such as Neo4j or ArangoDB.
This is a pretty simple algorithm, to improve our suggestion accuracy, we will need to incorporate a recommendation model which uses machine learning as part of our algorithm.
Recording analytics and metrics is one of our extended requirements. As we will be using Apache Kafka to publish all sorts of events, we can process these events and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing.
In a social media application, we have to be careful about using cache as our users expect the latest data. So, to prevent usage spikes from our resources we can cache the top 20% of the tweets.
To further improve efficiency we can add pagination to our system APIs. This decision will be helpful for users with limited network bandwidth as they won’t have to retrieve old messages unless requested.
Which cache eviction policy to use?
We can use solutions like Redis or Memcached and cache 20% of the daily traffic but what kind of cache eviction policy would best fit our needs?
Least Recently Used (LRU) can be a good policy for our system. In this policy, we discard the least recently used key first.
How to handle cache miss?
Whenever there is a cache miss, our servers can hit the database directly and update the cache with the new entries.
For more details, refer to Caching.
As we know, most of our storage space will be used for storing media files such as images, videos, or other files. Our media service will be handling both access and storage of the user media files.
But where can we store files at scale? Well, object storage or GlusterFS.
Content Delivery Network (CDN) increases content availability and redundancy while reducing bandwidth costs. Generally, static files such as images, and videos are served from CDN. We can use services like Amazon CloudFront or Cloudflare CDN for this use case.
Let us identify and resolve bottlenecks such as single points of failure in our design:
To make our system more resilient we can do the following:
Let’s design a Netflix like video streaming service, similar to services like Amazon Prime Video, Disney Plus, Hulu, Youtube, Vimeo, etc.
Netflix is a subscription-based streaming service that allows its members to watch TV shows and movies on an internet-connected device. It is available on platforms such as the Web, iOS, Android, TV, etc.
Our system should meet the following requirements:
Let’s start with the estimation and constraints.
Note: Make sure to check any scale or traffic-related assumptions with your interviewer.
This will be a read-heavy system, let us assume we have 1 billion total users with 200 million daily active users (DAU), and on average each user watches 5 videos a day. This gives us 1 billion videos watched per day.
\[ 200 \space million \times 5 \space videos = 1 \space billion/day \]
Assuming a 200:1
read/write ratio, about 5 million
videos will be uploaded every day.
\[ \frac{1}{200} \times 1 \space billion = 5 \space million/day \]
What would be Requests Per Second (RPS) for our system?
1 billion requests per day translate into 12K requests per second.
\[ \frac{1 \space billion}{(24 \space hrs \times 3600 \space seconds)} = \sim 12K \space requests/second \]
If we assume each video is 100 MB on average, we will require about 500 TB of storage every day.
\[ 5 \space million \times 100 \space MB = 500 \space TB/day \]
And for 10 years, we will require an astounding 1,825 PB of storage.
\[ 500 \space TB \times 365 \space days \times 10 \space years = \sim 1,825 \space PB \]
As our system is handling 500 TB of ingress every day, we will require a minimum bandwidth of around 5.8 GB per second.
\[ \frac{500 \space TB}{(24 \space hrs \times 3600 \space seconds)} = \sim 5.8 \space GB/second \]
Here is our high-level estimate:
Type | Estimate |
---|---|
Daily active users (DAU) | 200 million |
Requests per second (RPS) | 12K/s |
Storage (per day) | ~500 TB |
Storage (10 years) | ~1,825 PB |
Bandwidth | ~5.8 GB/s |
This is the general data model which reflects our requirements.
We have the following tables:
users This table will contain a user’s information
such as name
, email
, dob
, and
other details.
videos
As the name suggests, this table will store videos and their
properties such as title
, streamURL
,
tags
, etc. We will also store the corresponding
userID
.
tags
This table will simply store tags associated with a video.
views
This table helps us to store all the views received on a video.
comments
This table stores all the comments received on a video (like YouTube).
While our data model seems quite relational, we don’t necessarily need to store everything in a single database, as this can limit our scalability and quickly become a bottleneck.
We will split the data between different services each having ownership over a particular table. Then we can use a relational database such as PostgreSQL or a distributed NoSQL database such as Apache Cassandra for our use case.
Let us do a basic API design for our services:
Given a byte stream, this API enables video to be uploaded to our service.
uploadVideo(title: string, description: string, data: Stream<byte>, tags?: string[]): boolean
Parameters
Title (string
): Title of the new video.
Description (string
): Description of the new video.
Data (byte[]
): Byte stream of the video data.
Tags (string[]
): Tags for the video
(optional).
Returns
Result (boolean
): Represents whether the operation was
successful or not.
This API allows our users to stream a video with the preferred codec and resolution.
streamVideo(videoID: UUID, codec: Enum<string>, resolution: Tuple<int>, offset?: int): VideoStream
Parameters
Video ID (UUID
): ID of the video that needs to be
streamed.
Codec (Enum<string>
): Required codec of the
requested video, such as h.265
, h.264
,
VP9
, etc.
Resolution (Tuple<int>
): Resolution
of the requested video.
Offset (int
): Offset of the video stream in seconds to
stream data from any point in the video (optional).
Returns
Stream (VideoStream
): Data stream of the requested
video.
This API will enable our users to search for a video based on its title or tags.
searchVideo(query: string, nextPage?: string): Video[]
Parameters
Query (string
): Search query from the user.
Next Page (string
): Token for the next page, this can be
used for pagination (optional).
Returns
Videos (Video[]
): All the videos available for a
particular search query.
This API will allow our users to post a comment on a video (like YouTube).
comment(videoID: UUID, comment: string): boolean
Parameters
VideoID (UUID
): ID of the video user wants to comment
on.
Comment (string
): The text content of the comment.
Returns
Result (boolean
): Represents whether the operation was
successful or not.
Now let us do a high-level design of our system.
We will be using microservices architecture since it will make it easier to horizontally scale and decouple our services. Each service will have ownership of its own data model. Let’s try to divide our system into some core services.
User Service
This service handles user-related concerns such as authentication and user information.
Stream Service
The stream service will handle video streaming-related functionality.
Search Service
The service is responsible for handling search-related functionality. It will be discussed in detail separately.
Media service
This service will handle the video uploads and processing. It will be discussed in detail separately.
Analytics Service
This service will be used for metrics and analytics use cases.
What about inter-service communication and service discovery?
Since our architecture is microservices-based, services will be communicating with each other as well. Generally, REST or HTTP performs well but we can further improve the performance using gRPC which is more lightweight and efficient.
Service discovery is another thing we will have to take into account. We can also use a service mesh that enables managed, observable, and secure communication between individual services.
Note: Learn more about REST, GraphQL, gRPC and how they compare with each other.
There are so many variables in play when it comes to processing a video. For example, an average data size of two-hour raw 8K footage from a high-end camera can easily be up to 4 TB, thus we need to have some kind of processing to reduce both storage and delivery costs.
Here’s how we can process videos once they’re uploaded by the content team (or users in YouTube’s case) and are queued for processing in our message queue.
Let’s discuss how this works:
This is the first step of our processing pipeline. File chunking is the process of splitting a file into smaller pieces called chunks. It can help us eliminate duplicate copies of repeating data on storage, and reduces the amount of data sent over the network by only selecting changed chunks.
Usually, a video file can be split into equal size chunks based on timestamps but Netflix instead splits chunks based on scenes. This slight variation becomes a huge factor for a better user experience since whenever the client requests a chunk from the server, there is a lower chance of interruption as a complete scene will be retrieved.
This step checks if the video adheres to the content policy of the platform. This can be pre-approved as in the case of Netflix according to content rating of the media or can be strictly enforced like by YouTube.
This entire process is done by a machine learning model which performs copyright, piracy, and NSFW checks. If issues are found, we can push the task to a dead-letter queue (DLQ) and someone from the moderation team can do further inspection.
Transcoding is a process in which the original data is decoded to an intermediate uncompressed format, which is then encoded into the target format. This process uses different codecs to perform bitrate adjustment, image downsampling, or re-encoding the media.
This results in a smaller size file and a much more optimized format for the target devices. Standalone solutions such as FFmpeg or cloud-based solutions like AWS Elemental MediaConvert can be used to implement this step of the pipeline.
This is the last step of the processing pipeline and as the name suggests, this step handles the conversion of the transcoded media from the previous step into different resolutions such as 4K, 1440p, 1080p, 720p, etc.
It allows us to fetch the desired quality of the video as per the user’s request, and once the media file finishes processing, it gets uploaded to a distributed file storage such as HDFS such as Amazon S3 for later retrieval during streaming.
Note: We can add additional steps such as subtitles and thumbnails generation as part of our pipeline.
Why are we using a message queue?
Processing videos as a long-running task and using a message queue makes much more sense. It also decouples our video processing pipeline from the upload functionality. We can use something like Amazon SQS or RabbitMQ to support this.
Video streaming is a challenging task from both the client and server perspectives. Moreover, internet connection speeds vary quite a lot between different users. To make sure users don’t re-fetch the same content, we can use a Content Delivery Network (CDN).
Netflix takes this a step further with its Open Connect program. In this approach, they partner with thousands of Internet Service Providers (ISPs) to localize their traffic and deliver their content more efficiently.
What is the difference between Netflix’s Open Connect and a traditional Content Delivery Network (CDN)?
Netflix Open Connect is a purpose-built Content Delivery Network (CDN) responsible for serving Netflix’s video traffic. Around 95% of the traffic globally is delivered via direct connections between Open Connect and the ISPs their customers use to access the internet.
Currently, they have Open Connect Appliances (OCAs) in over 1000 separate locations around the world. In case of issues, Open Connect Appliances (OCAs) can failover, and the traffic can be re-routed to Netflix servers.
Additionally, we can use Adaptive bitrate streaming protocols such as HTTP Live Streaming (HLS) which is designed for reliability and it dynamically adapts to network conditions by optimizing playback for the available speed of the connections.
Lastly, for playing the video from where the user left off (part of
our extended requirements), we can simply use the offset
property we stored in the views
table to retrieve the scene
chunk at that particular timestamp and resume the playback for the
user.
Sometimes traditional DBMS are not performant enough, we need something which allows us to store, search, and analyze huge volumes of data quickly and in near real-time and give results within milliseconds. Elasticsearch can help us with this use case.
Elasticsearch is a distributed, free and open search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. It is built on top of Apache Lucene.
How do we identify trending content?
Trending functionality will be based on top of the search
functionality. We can cache the most frequently searched queries in the
last N
seconds and update them every M
seconds
using some sort of batch job mechanism.
Sharing content is an important part of any platform, for this, we can have some sort of URL shortener service in place that can generate short URLs for the users to share.
For more details, refer to the URL Shortener system design.
It’s time to discuss our design decisions in detail.
To scale out our databases we will need to partition our data. Horizontal partitioning (aka Sharding) can be a good first step. We can use partitions schemes such as:
The above approaches can still cause uneven data and load distribution, we can solve this using Consistent hashing.
For more details, refer to Sharding and Consistent Hashing.
Platforms like Netflix and YouTube use Geo-blocking to restrict content in certain geographical areas or countries. This is primarily done due to legal distribution laws that Netflix has to adhere to when they make a deal with the production and distribution companies. In the case of YouTube, this will be controlled by the user during the publishing of the content.
We can determine the user’s location either using their IP or region settings in their profile then use services like Amazon CloudFront which supports a geographic restrictions feature or a geolocation routing policy with Amazon Route53 to restrict the content and re-route the user to an error page if the content is not available in that particular region or country.
Netflix uses a machine learning model which uses the user’s viewing history to predict what the user might like to watch next, an algorithm like Collaborative Filtering can be used.
However, Netflix (like YouTube) uses its own algorithm called Netflix Recommendation Engine which can track several data points such as:
For more detail, refer to Netflix recommendation research.
Recording analytics and metrics is one of our extended requirements. We can capture the data from different services and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing. Additionally, we can store critical metadata in the views table to increase data points within our data.
In a streaming platform, caching is important. We have to be able to cache as much static media content as possible to improve user experience. We can use solutions like Redis or Memcached but what kind of cache eviction policy would best fit our needs?
Which cache eviction policy to use?
Least Recently Used (LRU) can be a good policy for our system. In this policy, we discard the least recently used key first.
How to handle cache miss?
Whenever there is a cache miss, our servers can hit the database directly and update the cache with the new entries.
For more details, refer to Caching.
As most of our storage space will be used for storing media files such as thumbnails and videos. Per our discussion earlier, the media service will be handling both the upload and processing of media files.
We will use distributed file storage such as HDFS such as Amazon S3 for storage and streaming of the content.
Content Delivery Network (CDN) increases content availability and redundancy while reducing bandwidth costs. Generally, static files such as images, and videos are served from CDN. We can use services like Amazon CloudFront or Cloudflare CDN for this use case.
Let us identify and resolve bottlenecks such as single points of failure in our design:
To make our system more resilient we can do the following:
Let’s design an Uber like ride-hailing service, similar to services like Lyft, OLA Cabs, etc.
Uber is a mobility service provider, allowing users to book rides and a driver to transport them in a way similar to a taxi. It is available on the web and mobile platforms such as Android and iOS.
Our system should meet the following requirements:
We will design our system for two types of users: Customers and Drivers.
Customers
Drivers
Let’s start with the estimation and constraints.
Note: Make sure to check any scale or traffic-related assumptions with your interviewer.
Let us assume we have 100 million daily active users (DAU) with 1 million drivers and on average our platform enables 10 million rides daily.
If on average each user performs 10 actions (such as request a check available rides, fares, book rides, etc.) we will have to handle 1 billion requests daily.
\[ 100 \space million \times 10 \space actions = 1 \space billion/day \]
What would be Requests Per Second (RPS) for our system?
1 billion requests per day translate into 12K requests per second.
\[ \frac{1 \space billion}{(24 \space hrs \times 3600 \space seconds)} = \sim 12K \space requests/second \]
If we assume each message on average is 400 bytes, we will require about 400 GB of database storage every day.
\[ 1 \space billion \times 400 \space bytes = \sim 400 \space GB/day \]
And for 10 years, we will require about 1.4 PB of storage.
\[ 400 \space GB \times 10 \space years \times 365 \space days = \sim 1.4 \space PB \]
As our system is handling 400 GB of ingress every day, we will require a minimum bandwidth of around 5 MB per second.
\[ \frac{400 \space GB}{(24 \space hrs \times 3600 \space seconds)} = \sim 5 \space MB/second \]
Here is our high-level estimate:
Type | Estimate |
---|---|
Daily active users (DAU) | 100 million |
Requests per second (RPS) | 12K/s |
Storage (per day) | ~400 GB |
Storage (10 years) | ~1.4 PB |
Bandwidth | ~5 MB/s |
This is the general data model which reflects our requirements.
We have the following tables:
customers
This table will contain a customer’s information such as
name
, email
, and other details.
drivers
This table will contain a driver’s information such as
name
, email
, dob
and other
details.
trips
This table represents the trip taken by the customer and stores data
such as source
, destination
, and
status
of the trip.
cabs
This table stores data such as the registration number, and type (like Uber Go, Uber XL, etc.) of the cab that the driver will be driving.
ratings
As the name suggests, this table stores the rating
and
feedback
for the trip.
payments
The payments table contains the payment-related data with the
corresponding tripID
.
While our data model seems quite relational, we don’t necessarily need to store everything in a single database, as this can limit our scalability and quickly become a bottleneck.
We will split the data between different services each having ownership over a particular table. Then we can use a relational database such as PostgreSQL or a distributed NoSQL database such as Apache Cassandra for our use case.
Let us do a basic API design for our services:
Through this API, customers will be able to request a ride.
requestRide(customerID: UUID, source: Tuple<float>, destination: Tuple<float>, cabType: Enum<string>, paymentMethod: Enum<string>): Ride
Parameters
Customer ID (UUID
): ID of the customer.
Source (Tuple<float>
): Tuple containing the
latitude and longitude of the trip’s starting location.
Destination (Tuple<float>
): Tuple containing the
latitude and longitude of the trip’s destination.
Returns
Result (Ride
): Associated ride information of the
trip.
This API will allow customers to cancel the ride.
cancelRide(customerID: UUID, reason?: string): boolean
Parameters
Customer ID (UUID
): ID of the customer.
Reason (UUID
): Reason for canceling the ride
(optional).
Returns
Result (boolean
): Represents whether the operation was
successful or not.
This API will allow the driver to accept or deny the trip.
acceptRide(driverID: UUID, rideID: UUID): boolean
denyRide(driverID: UUID, rideID: UUID): boolean
Parameters
Driver ID (UUID
): ID of the driver.
Ride ID (UUID
): ID of the customer requested ride.
Returns
Result (boolean
): Represents whether the operation was
successful or not.
Using this API, a driver will be able to start and end the trip.
startTrip(driverID: UUID, tripID: UUID): boolean
endTrip(driverID: UUID, tripID: UUID): boolean
Parameters
Driver ID (UUID
): ID of the driver.
Trip ID (UUID
): ID of the requested trip.
Returns
Result (boolean
): Represents whether the operation was
successful or not.
This API will enable customers to rate the trip.
rateTrip(customerID: UUID, tripID: UUID, rating: int, feedback?: string): boolean
Parameters
Customer ID (UUID
): ID of the customer.
Trip ID (UUID
): ID of the completed trip.
Rating (int
): Rating of the trip.
Feedback (string
): Feedback about the trip by the
customer (optional).
Returns
Result (boolean
): Represents whether the operation was
successful or not.
Now let us do a high-level design of our system.
We will be using microservices architecture since it will make it easier to horizontally scale and decouple our services. Each service will have ownership of its own data model. Let’s try to divide our system into some core services.
Customer Service
This service handles customer-related concerns such as authentication and customer information.
Driver Service
This service handles driver-related concerns such as authentication and driver information.
Ride Service
This service will be responsible for ride matching and quadtree aggregation. It will be discussed in detail separately.
Trip Service
This service handles trip-related functionality in our system.
Payment Service
This service will be responsible for handling payments in our system.
Notification Service
This service will simply send push notifications to the users. It will be discussed in detail separately.
Analytics Service
This service will be used for metrics and analytics use cases.
What about inter-service communication and service discovery?
Since our architecture is microservices-based, services will be communicating with each other as well. Generally, REST or HTTP performs well but we can further improve the performance using gRPC which is more lightweight and efficient.
Service discovery is another thing we will have to take into account. We can also use a service mesh that enables managed, observable, and secure communication between individual services.
Note: Learn more about REST, GraphQL, gRPC and how they compare with each other.
Here’s how our service is expected to work:
How do we efficiently send and receive live location data from the client (customers and drivers) to our backend? We have two different options:
Pull model
The client can periodically send an HTTP request to servers to report its current location and receive ETA and pricing information. This can be achieved via something like Long polling.
Push model
The client opens a long-lived connection with the server and once new data is available it will be pushed to the client. We can use WebSockets for this.
The pull model approach is not scalable as it will create unnecessary request overhead on our servers and most of the time the response will be empty, thus wasting our resources. To minimize latency, using the push model with WebSockets which are only unidirectional.
Additionally, the client application should have some sort of background job mechanism to ping GPS location while the application is in the background.
Note: Learn more about Long polling, WebSockets, Server-Sent Events (SSE).
We need a way to efficiently store and query nearby drivers. Let’s explore different solutions we can incorporate into our design.
SQL
We already have access to the latitude and longitude of our customers, and with databases like PostgreSQL and MySQL we can perform a query to find nearby driver locations given a latitude and longitude (X, Y) within a radius (R).
SELECT * FROM locations WHERE lat BETWEEN X-R AND X+R AND long BETWEEN Y-R AND Y+R
However, this is not scalable, and performing this query on large datasets will be quite slow.
Geohashing
Geohashing is a geocoding method used to encode geographic coordinates such as latitude and longitude into short alphanumeric strings. It was created by Gustavo Niemeyer in 2008.
Geohash is a hierarchical spatial index that uses Base-32 alphabet encoding, the first character in a geohash identifies the initial location as one of the 32 cells. This cell will also contain 32 cells. This means that to represent a point, the world is recursively divided into smaller and smaller cells with each additional bit until the desired precision is attained. The precision factor also determines the size of the cell.
For example, San Francisco with coordinates
37.7564, -122.4016
can be represented in geohash as
9q8yy9mf
.
Now, using the customer’s geohash we can determine the nearest available driver by simply comparing it with the driver’s geohash. For better performance, we will index and store the geohash of the driver in memory for faster retrieval.
Quadtrees
A Quadtree is a tree data structure in which each internal node has exactly four children. They are often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. Each child or leaf node stores spatial information. Quadtrees are the two-dimensional analog of Octrees which are used to partition three-dimensional space.
Quadtrees enable us to search points within a two-dimensional range efficiently, where those points are defined as latitude/longitude coordinates or as cartesian (x, y) coordinates.
We can save further computation by only subdividing a node after a certain threshold.
Quadtree seems perfect for our use case, we can update the Quadtree every time we receive a new location update from the driver. To reduce the load on the quadtree servers we can use an in-memory datastore such as Redis to cache the latest updates. And with the application of mapping algorithms such as the Hilbert curve, we can perform efficient range queries to find nearby drivers for the customer.
What about race conditions?
Race conditions can easily occur when a large number of customers will be requesting rides simultaneously. To avoid this, we can wrap our ride matching logic in a Mutex to avoid any race conditions. Furthermore, every action should be transactional in nature.
For more details, refer to Transactions and Distributed Transactions.
How to find the best drivers nearby?
Once we have a list of nearby drivers from the Quadtree servers, we can perform some sort of ranking based on parameters like average ratings, relevance, past customer feedback, etc. This will allow us to broadcast notifications to the best available drivers first.
Dealing with high demand
In cases of high demand, we can use the concept of Surge Pricing. Surge pricing is a dynamic pricing method where prices are temporarily increased as a reaction to increased demand and mostly limited supply. This surge price can be added to the base price of the trip.
For more details, learn how surge pricing works with Uber.
Handling payments at scale is challenging, to simplify our system we can use a third-party payment processor like Stripe or PayPal. Once the payment is complete, the payment processor will redirect the user back to our application and we can set up a webhook to capture all the payment-related data.
Push notifications will be an integral part of our platform. We can use a message queue or a message broker such as Apache Kafka with the notification service to dispatch requests to Firebase Cloud Messaging (FCM) or Apple Push Notification Service (APNS) which will handle the delivery of the push notifications to user devices.
For more details, refer to the WhatsApp system design where we discuss push notifications in detail.
It’s time to discuss our design decisions in detail.
To scale out our databases we will need to partition our data. Horizontal partitioning (aka Sharding) or regions. If we divide the locations into regions using let’s say zip codes, we can effectively store all the data in a given region on a fixed node. But this can still cause uneven data and load distribution, we can solve this using Consistent hashing.
For more details, refer to Sharding and Consistent Hashing.
Recording analytics and metrics is one of our extended requirements. We can capture the data from different services and run analytics on the data using Apache Spark which is an open-source unified analytics engine for large-scale data processing. Additionally, we can store critical metadata in the views table to increase data points within our data.
In a location services-based platform, caching is important. We have to be able to cache the recent locations of the customers and drivers for fast retrieval. We can use solutions like Redis or Memcached but what kind of cache eviction policy would best fit our needs?
Which cache eviction policy to use?
Least Recently Used (LRU) can be a good policy for our system. In this policy, we discard the least recently used key first.
How to handle cache miss?
Whenever there is a cache miss, our servers can hit the database directly and update the cache with the new entries.
For more details, refer to Caching.
Let us identify and resolve bottlenecks such as single points of failure in our design:
To make our system more resilient we can do the following: