When you build APIs that bill per token—like AI workloads—rate limiting stops being just a traffic control feature.
It becomes a revenue-protection mechanism.
We learned this the hard way: if you let users run multiple concurrent AI tasks before their token usage is reconciled, you can lose real money.
So we started from NestJS’s built-in throttler, explored Redis-based options, and eventually built our own token-bucket limiter with Lua.
This post walks through that decision process—what works, what doesn’t, and how to evolve your rate limiting when you move from simple backend requests to token-based billing.
1. Starting Point: NestJS Throttler
NestJS ships with a throttler module:
npm install @nestjs/throttler
It’s simple to set up:
ThrottlerModule.forRoot({
ttl: 60, // seconds
limit: 10 // max requests per TTL window
});
Behind the scenes, the ThrottlerGuard
intercepts requests and counts how many times a key (like IP:route
) appears in a local in-memory map.
How it works internally
-
Each request pushes a timestamp into an array.
-
On each hit, it removes timestamps older than
Date.now() - ttl
. -
If
array.length > limit
, it throwsTooManyRequestsException()
. -
Old entries expire automatically via
setTimeout
.
It’s a fixed-window counter—fast, but not distributed. Each NestJS instance has its own counters.
The problem
If you scale horizontally, every instance has its own throttling state.
A single user hitting multiple instances can easily bypass the limit.
For example:
-
Instance A: 10 requests
-
Instance B: 10 requests
-
Combined: 20 requests (limit intended was 10)
This works fine for small deployments, but fails for multi-node APIs.
2. Making It Distributed: Redis Storage
To synchronize rate limits across instances, NestJS supports pluggable storage backends.
You can install the Redis storage adapter:
npm install @nestjs/throttler-storage-redis ioredis
and update your module:
import { ThrottlerStorageRedisService } from '@nestjs/throttler-storage-redis';
ThrottlerModule.forRoot({
ttl: 60,
limit: 10,
storage: new ThrottlerStorageRedisService({
host: 'localhost',
port: 6379,
}),
});
How this version works
Internally, it uses Redis sorted sets (ZSET) and commands like:
ZREMRANGEBYSCORE key 0 (now - ttl)
ZADD key now now
EXPIRE key ttl
ZCARD key
That turns the throttler into a sliding-window limiter:
-
Each timestamp is recorded in Redis.
-
Old entries fall off automatically as their scores expire.
-
The counter is shared across all app instances.
Distributed, smoother than fixed-window, but still request-based rather than cost-based.
When it’s good enough
This Redis-backed throttler is perfect if you:
-
Only care about requests per second/minute
-
Don’t need per-tier token limits
-
Want plug-and-play scaling across multiple app instances
But if you’re charging users for token usage, not just requests, it’s not sufficient.
3. Why Request Limits Weren’t Enough for AI Workloads
Our use case: users trigger AI tasks that consume tokens.
A “request” can mean anywhere from 100 to 200,000 tokens.
That means:
-
A user sending 100 small tasks is fine.
-
A user sending 3 giant prompts could blow their budget instantly.
We needed rate limiting by token cost, not just request count. And we needed it atomic, tier-aware, and distributed. The NestJS throttler can’t calculate token cost per request. We could extend ThrottlerGuard
, but it still lacks atomic safety under concurrency.
That’s when we moved to Redis + Lua.
4. Token Bucket with Lua (Tier-Aware and Atomic)
The token-bucket algorithm gives a smooth, fair way to rate-limit while allowing bursts.
Each user has a bucket of tokens that refills at a steady rate.
Each request consumes tokens equal to its cost.
When the bucket’s empty, new requests are rejected until refill.
The Lua Script
-- KEYS[1] = user key, e.g. "rate:{123}"
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = fill_rate_per_ms
-- ARGV[3] = now_ms
-- ARGV[4] = cost (tokens needed)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local fill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call("HMGET", key, "tokens", "ts")
local tokens = tonumber(data[1])
local ts = tonumber(data[2])
if tokens == nil then tokens = capacity end
if ts == nil then ts = now end
local delta = now - ts
if delta < 0 then delta = 0 end
tokens = math.min(capacity, tokens + (delta * fill_rate))
local allowed = 0
local retry_after_ms = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
else
retry_after_ms = math.ceil((cost - tokens) / fill_rate)
end
redis.call("HMSET", key, "tokens", tokens, "ts", now)
redis.call("PEXPIRE", key, math.ceil(capacity / fill_rate))
return { allowed, tokens, retry_after_ms }
This script:
-
Refills tokens based on elapsed time.
-
Deducts cost atomically.
-
Returns remaining tokens and retry delay.
No race conditions, even under heavy concurrency.
5. Integrating the Lua Bucket in NestJS
Load it dynamically with ioredis
:
import IORedis from 'ioredis';
export class RateLimiter {
private sha!: string;
constructor(private redis: IORedis, private lua: string) {}
async init() {
this.sha = await this.redis.script('LOAD', this.lua);
}
async checkTokens(userId: string, capacity: number, fillRate: number, cost: number) {
const key = `rate:{${userId}}`;
const now = Date.now();
const res = await this.redis.evalsha(this.sha, 1, key, capacity, fillRate, now, cost);
const [allowed, remaining, retryAfter] = (res as any[]).map(Number);
return { allowed: !!allowed, remaining, retryAfter };
}
}
Then wrap it in a NestJS interceptor:
@Injectable()
export class TokenRateLimitInterceptor implements NestInterceptor {
constructor(private limiter: RateLimiter, private tiers: TierService) {}
async intercept(ctx: ExecutionContext, next: CallHandler) {
const req = ctx.switchToHttp().getRequest();
const user = req.user;
if (!user) return next.handle();
const tier = this.tiers.get(user.tier);
const capacity = tier.burstTokens;
const fillRate = tier.tokensPerMinute / 60000;
const cost = req.tokenCost ?? 1000;
const { allowed, retryAfter } = await this.limiter.checkTokens(
user.id, capacity, fillRate, cost
);
if (!allowed) {
throw new TooManyRequestsException({
message: 'Token limit exceeded',
retry_after_ms: retryAfter,
});
}
return next.handle();
}
}
Use hash-tagged keys (rate:{userId}
) so Redis Cluster routes all per-user keys to the same slot.
6. Real-World Examples
Several open-source projects use similar Lua-based logic:
-
BitMEX/node-redis-token-bucket-ratelimiter – Token bucket in Lua + Redis for Node.js.
-
WeTransfer/Prorate – Leaky bucket using Redis
TIME
for consistency. -
garana/o1.rate-limiter – Sliding window (ZSET-based) Lua limiter.
-
Losant/redis-gcra – GCRA variant for smooth rate limiting.
-
Recruitee/plug_limit – Elixir Plug using Redis Lua scripts.
All rely on Redis atomic operations—no modules, no race conditions.
7. Putting It All Together
| Approach | Algorithm | Scope | Atomic | Cost-aware | Recommended for |
|———–|————|——–|———-|————-|—————-|
| NestJS Default | Fixed window | Single instance | No | No | Local or small apps |
| Throttler + Redis | Sliding window | Distributed | Yes | No | Multi-instance APIs |
| Lua Token Bucket | Continuous refill | Distributed | Yes | Yes | AI or token-based workloads |
8. Takeaways
-
The NestJS throttler is easy to use but local to a single process.
-
Using Redis storage turns it into a distributed sliding-window limiter.
-
For workloads billed by token or requiring atomic accuracy, a Lua token-bucket limiter is safer and more flexible.
-
Redis + Lua gives you fast, atomic, and fully distributed enforcement without modifying Redis or adding modules.
-
These techniques are proven in production by teams like BitMEX, WeTransfer, and Losant.
If you’re building APIs where each request can have wildly different computational costs, a token-bucket limiter is your best line of defense between predictable performance and unexpected loss.
Originally published on ofeng.org.