Pastebin System Design
Overview
Let’s design a Pastebin like web service, where users can store plain text. Users of the service will enter a piece of text and get a randomly generated URL to access it. Similar Services: pastebin.com, pasted.co, chopapp.com.
The design of pastebin is similar to Tiny Url System Design. If you have not gone through that, please have a look.
Requirements
Functional
- User should be able to paste/upload a text and get a url corresponding to that text.
- User should be able to fetch the text by going to the given url.
- User should be able to give an expiry time which will be optional.
- User should be able to give a custom url for there text.
Non Functional
- Since a url is not immediately used by a user after it is generated we are fine with eventual consistency.
- Our system will be read heavy as there will be more number of calls to fetch the text corresponding to url.
- Fetching a text from url latency should be low (<100 ms).
Capacity Estimation
Let’s assume there will be 1 million new paste request every day and considering a read to write ratio of 10:1, there will be 10 million read request every day.
Traffic estimates:
- Write request/sec = 11.57 ~ 12 request/sec.
- Read request/sec ~120 request/sec.
Storage estimates:
Let’s put a limit to the max size of data a user can store in 1 paste to 1MB that will give user’s to store 1 million character text in 1 paste. Let’s assume on an average a paste is of 10KB size.
So, total storage required per day = 1M*10KB ~ 10GB
Let’s assume we want to store a paste for 5 years then
Total storage required = 10GB * 365*5 = 18TB.
API’s
Based on the requirements we will define our API’s that will be required. We will go with REST API’s as they are lightweight, flexible and easy to implement.
- Create a new paste
POST: /paste
Request {
data: String
expiryTimeInMs: long
pasteName: String
customUrl: String}Response {
pasteUrl: String}
2. Get the existing paste using the url provided.
GET: /paste/{id}
Response {
data: String
creationDate: Date
expiryDate: Date}
3. Delete an existing pasting by passing the pasteId.
DELETE: /paste/{id}
Design Considerations
Where do we store the text uploaded by the user? We can store the user’s text data in a file and store that file in an object storage. We can use AWS S3 to store the user’s file.
Where do we store the mapping of file in object storage to a pasteId and other paste metadata? We can store these data in persistent storage which is highly available and scalable. Since we don’t require any relation between database entities we can go with NoSQL database. For our case we can choose a key value based database like AWS DynamoDB.
Database Design
We need two tables to store our data.
- User table to store the user details.
userId: String [HashKey]
name: String
emailId: String
creationDateInUtc: long
2. Url table to store the pasteUrl to s3 key mapping details.
pasteUrl: String [HashKey]
s3ke: String
creationDateInUtc: long
expirationDateInUtc: long
High Level Design
Component Details
- PasteUrlFormat — As per our capacity estimation calculation we will require to store 1.8 billion pasteUrl records. If we choose to create pasteUrl by using [a-z, A-Z, 0–9] with 6 character long string then we can have 6²⁶~ 56 billion records. So, a 6 character long string will be sufficient for generating all pasteUrls. Once a pasteUrl expires we can reuse the pasteUrl for some other paste object. Hence, limiting the number of storage.
- KeyGenerationService —This service is responsible for providing a new Key(aka pasteUrl). We need to create a key which should be 6 character long. There can be multiple approaches which we can discuss below.
— Random key generation on every request. One approach that we can think of for generating a paste url is that on every request we use randomisation to generate a key and return it to pasteBinService. The issues with this approach are- 1) There can be a duplicate key generation using randomisation and KeyGenerationService has to retry the request to generate a different key, this will increase the latency of the write request. 2) We cannot reuse the key’s once they are expired, so we need to update our paste Code length more frequently to meet the traffic requirements.
— Pre generate the keys and use the keys on every request. The other approach that we can think is to pre generate the keys and store them in a database. The database will contain the key and a flag corresponding to that key which will be used to determine if that key is already in use or not. For a new pasteUrl request the KeyGenerationService will fetch an unused key from keyDDB and return it to pasteBinService. Here we need to maintain a high consistency so that same key is not returned for 2 different request. What we can do to solve this is, we can store some keys in KeyGenerationService as a cache on individual host. Each host can query the database and fetch few(let’s say 1000) keys and store them in cache and mark those keys as used in DDB. Keeping the keys in cache can be done via separate task scheduled at a KeyGenerationService level. Once a pasteUrl expires the entry in the database will marked as available by a separate executor(like an AWS lambda). - KeyDDB — Database to store the pre generated keys, it will contain the key and a parameter corresponding to a key that tells wether this key is already used or not.
- Lambda — AWS lambda that will update the KeyDDB and mark the key as available whenever there is an existing paste expires. Lambda can listen to pasteDDB update events and update the KeyDDB.
- PastebinService — This service is responsible for providing all the client facing API’s. This service will upload the user’s file to S3 bucket and store the entry in PasteBin DB. This service will also be responsible for fetching the user’s text file and its details on get call.
- S3 Bucket — This is the S3 bucket we are using to store the user’s text file.
- RedisCache — Since our system is ready heavy which means that we will be fetching paste by providing a pasteUrl. To improve the latency for our getPaste we can add a caching layer, this layer will store the pasteUrl to PasteS3Key data in the cache. Our PasteBinService will first try to fetch the data from cache, if the data will be present in cache then it will return the response, if data is not present in cache then it will fetch the data from PasteDDB.
- PasteDDB — This is the PasteBin database using AWS Dynamo DB to store the database entities defined above.
- LoadBalancer —Since, we want to make our system highly available we need to add a load balancer between our client and pasteBinService. There will be multiple instances running pasteBinService and the load balancer’s job is to distribute the traffic among different servers. This will ensure that the request does not go to any faulty server and load is distributed equally among different servers.
- Client — The client here refers to the front end application or a 3rd party system which will be using our exposed API’s to provide a pasteUrl service to end users.
Future Extension
- Add user access level check. The creator of a paste can specify which users can access the paste.
- Add user analytics which captures user behaviour like how many times a paste was accessed, the total number of paste per sec created/fetched, etc.
Resources
- Best Practices for rest API Design
- Design Patterns: Elements of Reusable Object Oriented Software
- Designing data intensive applications