8 Software Engineering Principles (Including SOLID)
System Design Interview Preparation Slides: https://drive.google.com/file/d/1IVAWvgvvpViyvBw9NB0JBgWsrfYj11Nl/view
- https://www.geeksforgeeks.org/top-10-system-design-interview-questions-and-answers/
- https://www.geeksforgeeks.org/how-to-crack-system-design-round-in-interviews/
- https://towardsdatascience.com/the-complete-guide-to-the-system-design-interview-ba118f48bdfc
- https://www.notion.so/System-Design-d7cb433478cf47cba54a3bc1889c8bb1
How to rock a systems design interview
Introduction to Architecting Systems for Scale
Scalable System Design Patterns
Scalable Web Architecture and Distributed Systems
What is the best way to design a web site to be highly scalable?
https://github.com/donnemartin/system-design-primer
Detailed SQL/NoSQL , CAP, ACID/BASE Discussion
API GATEWAY
LOAD BALANCING METHODS
There is a variety of load balancing methods, which use different algorithms for different needs.
- Least Connection Method — This method directs traffic to the server with the fewest active connections. This approach is quite useful when there are a large number of persistent client connections which are unevenly distributed between the servers.
- Least Response Time Method — This algorithm directs traffic to the server with the fewest active connections and the lowest average response time.
- Least Bandwidth Method – This method selects the server that is currently serving the least amount of traffic measured in megabits per second (Mbps).
- Round Robin Method — This method cycles through a list of servers and sends each new request to the next server. When it reaches the end of the list, it starts over at the beginning. It is most useful when the servers are of equal specification and there are not many persistent connections.
- Weighted Round Robin Method — The weighted round-robin scheduling is designed to better handle servers with different processing capacities. Each server is assigned a weight (an integer value that indicates the processing capacity). Servers with higher weights receive new connections before those with less weights and servers with higher weights get more connections than those with less weights.
- IP Hash — Under this method, a hash of the IP address of the client is calculated to redirect the request to a server.
How to Identify Bottlenecks?
- Performance Testing
- Load Testing (Expected Load, how the system behaves, it’s scalable)
- Stress Testing (Breaking Point; which components and resources memory, CPU, disk I/O)
- SOAK Testing (Leaks in resources etc; i.e. memory)
- Apache Jmeter can be used to generate large load
- Health Monitoring
- Metrics, Dashboards, Alerts that shows:
- Latency (some kind of delay)
- Traffic
- Errors
- Saturations
- Auditing (Weak or Strong Auditing; Lamba system)
- Play/Game dates
How to Fix Bottlenecks?
- Optimise processes and increase output/products/services using the same resources (Vertical scaling)
- Preparing before hand during non-peak hours (Pre processing / CRON job)
- Keep backups and avoid single point failure (Replications / Master-Slave )
- Hire more resources (Horizontal scaling)
- Manage resources based on the business needs (7 pizza chef, 3 garlic bread) and separate duties, entities (Microservices)
- Distributed Systems / Positioning. Have multiple DataCenters for Disaster Recovery and Business Continuity (maybe even in different countries). This might also help with response time (depending on how the data is routed) + Fault Tolerance.
- LoadBalancing can help to be more efficient with using the resources (Send the customer orders to Pizza Shop1 vs. Pizza Shop2; based on the order, location, wait time etc).
- Decouple the system; separate out the concerns… Order Processing, Processing, Delivery etc.
- Put Metrics in place (Reporting, Analytics, Auditing, Machine Learning)
- Make sure the solution is Extensible
Jeff Dean: “Achieving Rapid Response Times in Large Online Services” Keynote
Kilobytes Megabytes Gigabytes Terabytes: https://web.stanford.edu/class/cs101/bits-gigabytes.html
System Design Samples
Introduction to AWS Services by AWS Training Center
System Design Interview – Notification System
FRONTEND SERVICE COMPONENTS
Request Validation
- Required parameters are present
- Data falss within an acceptable range
Authentication/Authorization
- Authentication is the process of validating the identity of a user or a service
- Authorization is the process of determining whether or not a specific actor is permitted to take action
TLS Termination
- TLS is a protocol that aims to provide privacy and data integration
- TLS termination refers to the process of decrypting request and passing on an unencrypted request to the back end service
- SSL on the load balancer is expensive
- Termination is usually handled by not a FrontEnd service itself but a separate TLS HTTP proxy that runs as a process on the same host
Server-side encryption
- Messages are encrypted as soon as FrontEnd receives them
- Messages are stored in encrypted form and FrontEnd decrypts messages only when they are sent back to a consumer
Caching
- Cache stores copies of source data
- It helps reduce load to backend services, increases overall system throughput and availability, decreases latency
- Stores metadata information about the most actively used queues
- Stores user identity information to save on calls to auth services
Rate limiting (Throttling)
- Throttling is the process of limiting the number of requests you can submit to a given operation in a given amount of time
- Throttling protects the web service from being overwhelmed with requests
- Leaky bucket algorithm is one of the most famous
Request dispatching
- Responsbile for all the activities associated with sending requests to backend services (client management, response handling, resource isolation, etc)
- Bulkhead pattern helps to isolate elements of an application into pools so that if one fails, the others will continue function
- Circuit Breaker pattern prevents an application from repeatedly trying to execute an operation that’s likely to fail
Request de-duplication
- May occur when a response from a successful sendMessage request failed to reach a client
- Lesser an issue for “at least once” delivery semantics, a bigger issue for “exactly once” and “at most once” delivery semantics
- Caching is usually used to store previously seen request ids
System Design Interview – Distributed Message Queue
Top K (Most Viewed Video) – High Level Architecture
System Design Interview – Top K Problem
Designing a URL Shortening like Tiny URL: https://www.educative.io/courses/grokking-the-system-design-interview/m2ygV4E81AR
Capacity Estimation and Constraints for Tiny URL #
Our system will be read-heavy. There will be lots of redirection requests compared to new URL shortenings. Let’s assume a 100:1 ratio between read and write.
Traffic estimates: Assuming, we will have 500M new URL shortenings per month, with 100:1 read/write ratio, we can expect 50B redirections during the same period: 100 * 500M => 50B
What would be Queries Per Second (QPS) for our system? New URLs shortenings per second: 500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s
Considering 100:1 read/write ratio, URLs redirections per second will be: 100 * 200 URLs/s = 20K/s
Storage estimates: Let’s assume we store every URL shortening request (and associated shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total number of objects we expect to store will be 30 billion: 500 million * 5 years * 12 months = 30 billion
Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage: 30 billion * 500 bytes = 15 TB https://js.educative.io/static/runjs/index.html?id=ULRhG
Bandwidth estimates: For write requests, since we expect 200 new URLs every second, total incoming data for our service will be 100KB per second: 200 * 500 bytes = 100 KB/s
For read requests, since every second we expect ~20K URLs redirections, total outgoing data for our service would be 10MB per second: 20K * 500 bytes = ~10 MB/s
Memory estimates: If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
Since we have 20K requests per second, we will be getting 1.7 billion requests per day: 20K * 3600 seconds * 24 hours = ~1.7 billion
To cache 20% of these requests, we will need 170GB of memory. 0.2 * 1.7 billion * 500 bytes = ~170GB
One thing to note here is that since there will be many duplicate requests (of the same URL), our actual memory usage will be less than 170GB.
** High-level estimates:** Assuming 500 million new URLs per month and 100:1 read:write ratio, following is the summary of the high level estimates for our service:
New URLs | 200/s |
URL redirections | 20K/s |
Incoming data | 100KB/s |
Outgoing data | 10MB/s |
Storage for 5 years | 15TB |
Memory for cache | 170GB |
Designing Instagram: https://www.educative.io/courses/grokking-the-system-design-interview/m2yDVZnQ8lG
Data Size Estimation for Instagram #
Let’s estimate how much data will be going into each table and how much total storage we will need for 10 years.
User: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be of 68 bytes: UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) + CreationDate (4 bytes) + LastLogin (4 bytes) = 68 bytes
If we have 500 million users, we will need 32GB of total storage. 500 million * 68 ~= 32GB
Photo: Each row in Photo’s table will be of 284 bytes: PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) + PhotoLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationDate (4 bytes) = 284 bytes
If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day: 2M * 284 bytes ~= 0.5GB per day For 10 years we will need 1.88TB of storage.
UserFollow: Each row in the UserFollow table will consist of 8 bytes. If we have 500 million users and on average each user follows 500 users. We would need 1.82TB of storage for the UserFollow table: 500 million users * 500 followers * 8 bytes ~= 1.82TB
Total space required for all tables for 10 years will be 3.7TB: 32GB + 1.88TB + 1.82TB ~= 3.7TB
https://cloud.google.com/docs/compare/aws
GCP-AWS Service comparisons
The following table provides a side-by-side comparison of the various services available on AWS and Google Cloud.
Service Category | Service | AWS | Google Cloud |
Compute | IaaS | Amazon Elastic Compute Cloud | Compute Engine |
PaaS | AWS Elastic Beanstalk | App Engine | |
FaaS | AWS Lambda | Cloud Functions | |
Containers | CaaS | Amazon Elastic Kubernetes Service, Amazon Elastic Container Service | Google Kubernetes Engine |
Containers without infrastructure | AWS Fargate | Cloud Run | |
Container registry | Amazon Elastic Container Registry | Container Registry | |
Networking | Virtual networks | Amazon Virtual Private Cloud | Virtual Private Cloud |
Load balancer | Elastic Load Balancer | Cloud Load Balancing | |
Dedicated Interconnect connection | AWS Direct Connect | Cloud Interconnect | |
Domains and DNS | Amazon Route 53 | Google Domains, Cloud DNS | |
CDN | Amazon CloudFront | Cloud CDN | |
DDoS firewall | AWS Shield, AWS WAF | Google Cloud Armor | |
Storage | Object storage | Amazon Simple Storage Service | Cloud Storage |
Block storage | Amazon Elastic Block Store | Persistent Disk | |
Reduced-availability storage | Amazon S3 Standard-Infrequent Access, Amazon S3 One Zone-Infrequent Access | Cloud Storage Nearline and Cloud Storage Coldline | |
Archival storage | Amazon Glacier | Cloud Storage Archive | |
File storage | Amazon Elastic File System | Filestore | |
In-memory data store | Amazon ElastiCache for Redis | Memorystore | |
Database | RDBMS | Amazon Relational Database Service, Amazon Aurora | Cloud SQL, Cloud Spanner |
Document data storage | Amazon DocumentDB | Firestore | |
In-memory data storage | Amazon ElastiCache | Memorystore | |
NoSQL: Key-value | Amazon DynamoDB | Firestore (document workloads), Cloud Bigtable (key/value workloads), Cloud Spanner (scale-out relational workloads) | |
NoSQL: Indexed | Amazon SimpleDB | Firestore | |
SQL | Amazon Aurora | Cloud Spanner | |
Data analytics | Data warehouse | Amazon Redshift | BigQuery |
Query service | Amazon Athena | BigQuery | |
Messaging | Amazon Simple Notification Service, Amazon Simple Queueing Service | Pub/Sub | |
Batch data processing | Amazon Elastic MapReduce, AWS Batch | Dataproc, Dataflow | |
Stream data processing | Amazon Kinesis | Dataflow | |
Stream data ingest | Amazon Kinesis | Pub/Sub | |
Workflow orchestration | Amazon Data Pipeline, AWS Glue | Cloud Composer | |
Management tools | Deployment | AWS CloudFormation | Cloud Deployment Manager |
Cost management | AWS Budgets | Cost Management | |
Operations | Monitoring | Amazon CloudWatch | Cloud Monitoring |
Logging | Amazon CloudWatch Logs | Cloud Logging | |
Audit logging | AWS CloudTrails | Cloud Audit Logs | |
Debugging | AWS X-Ray | Cloud Debugger | |
Performance tracing | AWS X-Ray | Cloud Trace | |
Security & identity | IAM | Amazon Identity and Access Management | Identity and Access Management |
Secret management | AWS Secrets Manager | Secret Manager | |
Encrypted keys | AWS Key Management Service | Cloud Key Management Service | |
Resource monitoring | AWS Config | Cloud Asset Inventory | |
Vulnerability scanning | Amazon Inspector | Web Security Scanner | |
Threat detection | Amazon GuardDuty | Event Threat Detection (beta) | |
Microsoft Active Directory | AWS Directory Service | Managed Service for Microsoft Active Directory | |
Machine learning | Speech | Amazon Transcribe | Speech-to-Text |
Vision | Amazon Rekognition | Cloud Vision | |
Natural Language Processing | Amazon Comprehend | Cloud Natural Language API | |
Translation | Amazon Translate | Cloud Translation | |
Conversational interface | Amazon Lex | Dialogflow Enterprise Edition | |
Video intelligence | Amazon Rekognition Video | Video Intelligence API | |
Auto-generated models | Amazon SageMaker Autopilot | AutoML | |
Fully managed ML | Amazon SageMaker | AI Platform | |
Internet of Things | IoT services | Amazon IoT | Cloud IoT |
From <https://cloud.google.com/docs/compare/aws>
Monolithic vs. Microservices Architecture Comparison
MapReduce in Hadoop – by Simplilearn
ENTERPRISE ARCHITECTURE (EA)
The Open Group Architecture Framework (TOGAF)
Google G2 Library & Hilbert Curve
http://bit-player.org/extras/hilbert/hilbert-construction.html
Credits: Google Cloud, Grokking, System Design Interview, AWS Training Center, Simplilearn