_Rate limiting restricts the flow of requests that a server will accept in a time window. Whenever a caller exceeds this limit within this period, the rate limiter denies any excess requests, preventing them from ever reaching our server.
_Many modern applications delegate rate limiting to edge components such as API gateways or reverse proxies, which can efficiently enforce limits before traffic reaches backend services.
Here we will focus on designing a standalone rate limiter that is integrated into or called by application servers directly.
Question needs to be clarified with interviewer?
- the number of requests and the unit of time,
- the max size of the incoming requests,
- which endpoints to rate limit,
- how to respond when rate limiting,
- how to react on rate limiter failure,
- where to locate the rate limiter,
- where to store rate-limiting state,
- whether some endpoints require different limits,
- whether to limit per account or per IP or otherwise,
- whether to institute a global rate limit,
- whether different tiers of clients should be limited differently,
- whether to impose a strict limit or allow bursts of requests, and
- which rate-limiting algorithm to use.
- what should be the recovery time of rate limiting failure.
Functional Requirements
- We should be able to add different limit values for different APIs per minutes or hours(rate limiting state)
- _In-memory storage:Not persistent across application restarts, limited scalability., Small-scale applications with low traffic.
- _Centralized store: add latency and create potential bottleneck in our system, and run the risk of race conditions.
- _Distributed storage: Increased complexity, potential for data consistency issues. we can scale out horizontally to multiple stores, hence no bottleneck in our system. But this means we will have to synchronize state across all the stores. We can use Raft consensus algorithm to achieve the same.we must update the state in all other stores. And given the delay in updating state across all instances, we introduce the possibility of inaccuracies in rate-limiting state. Depending on the use case, these inaccuracies may or may not be tolerable.
- Redis: highly scalable in-memory data store
- Mem cached: - A distributed memory object caching system.
- The user should be able to set recovery time after limit is reached for each APIs based on API uses sensitivity
- Use an algorithm like a leaky bucket for building rate limiter
- Add user level based on user subscription
- Handle errors gracefully after the limit is reached.
- _Block request with 429(too many request) or 503(Service unavailable)
- _Shadowban abuser: return 200 but do not process the request
- _Throttle the request: return response with delays
- _Deprioritise: Process request once other queued request are processed
- Monitor the system to define failed and success matrix patterns and users access patterns for better configuration of rate limited
- Delete API counter key if API is not in use.
Non Functional Requirements
- Implementation should not be overhead on performance level in case of heigh API load
- Limit mapping should be kept in the cache like redis for quick access
- handle burst load where limit is not reached
- It should work with a distributed system and should be scalable
Use: 1. excessive use of our own resources the high load is malicious, as in a DDoS attack, rate limiting serves as our first line of defense 2. excessive use of external resources If we rely on an external service that is paid, setting a rate limit ensures that excessive usage will not lead to cost overruns 3. excessive use of our users’ resources
Where should we store counter that will be used to decide limit reach?
`Relational Database:
- High latency: Relational databases involve disk I/O, which introduces delays on every read/write.
- Concurrency bottlenecks: Handling thousands of concurrent updates (e.g., one per incoming request) can lead to locks and race conditions.
- Limited throughput: RDBMSs are not optimized for high-frequency, real-time counter updates.
Redis as distributed cache
- Sub-millisecond latency for both reads and writes.
- Atomic operations like
INCR,INCRBY, andEXPIRE, ensuring safe concurrent updates without race conditions. - TTL (Time-to-Live) support, allowing counters to reset automatically at the end of each time window.