The Node.js Scaling Model

Node.js, by design, is single-threaded. That simplicity is part of its power, and its limitation. The core event loop makes Node excellent for IO-bound workloads but easily blocked by any significant CPU usage. The effect is subtle until it isn't. Latency slowly increases. The metrics look spikey. Then one day, the alerts channel goes off with pod restarts, the system starts failing under pressure.

The twist: because Node is single-threaded and event-driven, it can make container-based orchestration more predictable, if we play defense.

In this post, we'll examine how to think about scaling Node.js applications in Kubernetes, why CPU is the right proxy metric if the team is disciplined, and how we as engineering leaders can reinforce the practices that prevent operational degradation as our systems and teams scale.

Though the context here is Kubernetes, these lessons generalize to any orchestrator: Nomad, ECS, Fly.io, etc. The fundamental patterns hold because the constraints are inherent to Node, not to the scheduler.

Rule One: API Layers Should Not Lift Heavy Things

It sounds obvious, but let’s establish the baseline. The API layer should not be CPU-intensive.

In a well-architected Node application, the role of the API layer is to:

Quickly receive a request,
Validate it,
Dispatch work to the appropriate async system or retrieve data,
Return a response.

The API layer should consistently aim to meet a latency SLO—say, 100-200ms for the 95th percentile. Any operation that threatens this SLO by being CPU-bound should be pushed into a queue and processed by a background worker. We use Node because it's great at handling many concurrent lightweight requests, not because it can crunch numbers.

When the API layer starts doing CPU-heavy work—PDF generation, large data joins (happens innocently in a fast-paced environment), image manipulation, etc.—the single-threaded nature of Node becomes a liability. Latency increases. The event loop gets blocked. And scaling based on CPU usage becomes unpredictable.

Rule Two: Target CPU Scaling Works — If We're Disciplined

In Kubernetes, Horizontal Pod Autoscalers (HPA) are often configured to scale pods based on CPU usage. It’s easy and works out of the box. For a Node application that follows Rule One, this is not just sufficient—it's ideal.

In a perfect world, we can load test our applications and calculate under happy path the expected RPM or RPS a service can handle. However, especially in the world of B2B SaaS applications with a large product breadth, there are numerous evolving access patterns.

Against this evolution, CPU becomes a reliable signal. If all business logic is fast, stateless, and async, then any increase in CPU utilization correlates with increased request volume. That makes target CPU utilization a good lever.

For example, a target of 60% CPU utilization means the HPA will try to maintain pods at a steady 60% usage. This makes for a clean, standard scaling rule across multiple services. It keeps scaling logic out of the application and squarely in the platform.

The Reality: Entropy Creeps In

In a growing B2B SaaS company, use cases evolve. Teams expand. Product demands increase. And over time, engineers will unintentionally write code that violates Rule One.

This isn't because they're negligent. It's because the incentives of feature development don't naturally include thinking about latency budgets, in-memory usage, or access patterns at scale. But the effects compound. One engineer adds a large join in memory to meet a deadline. Another uses JSON.parse over a huge dataset. Another calls multiple downstream APIs in serial.

The outcome is predictable: high average latencies, unpredictable CPU spikes, and eventually pod OOMs or restarts. The system goes from boring to unpredictable.

Bad. Example from an “architectural junk-drawer” getting deprecated. The resulting replicas-desired is frequently maxed, despite the number of pods intuitively being able to handle the expected rpm.

Metrics as The Defensive Line

To prevent surprises, teams should routinely monitor three simple metrics for every API-facing service:

CPU usage by pod
Memory usage by pod
Desired replicas (for the deployment)

Week interval by pod for a service exposing a business critical REST API. This service serves 30M-40M requests a day.

This isn't just for observability; it's for operational predictability. At my current company, we’ve established “Monthly Metrics Review” to create a dedicated space for these reviews and discussions.

If a service consistently sits at or near the maximum number of replicas, that’s a red flag (and can be modeled as an alert). Worse, if usage doesn't follow a cyclical pattern (business hours, timezones), then the system is scaling to max and staying there.

This is the early warning system. If CPU and memory are unpredictable, or the HPA is constantly maxed out, something in the code is violating the event loop contract. Fix it before the customers feel it.

Case Study: SSR Microservice in the Wild

One of our services provides server-side rendered pages for frequently updating CMS content. It was built in Node.js for consistency with the rest of our stack. Predictably, the rendering logic and upstream data fetching resulted in spikey CPU patterns.

Over a 2 day period. The left shows cpu per pod which appears to be averaging at 60%. The right shows the same with a 1 minute max rollup, demonstrating that cpu pods are frequently at max CPU within the sampling periods.

During high load, the average CPU usage per pod approached 100% for brief intervals, but the HPA did not scale because the spikes averaged out over the 1-minute window. Latency suffered.

This can be resolved by:

Reducing the HPA scaleUp window to respond more quickly to spikes.
Lowering the CPU target to 40% to trigger earlier scale-ups.

Ultimately, SSR is not a great use case for native Node (large string interpolation, concatenation etc). These changes stabilized the service while we planned a refactor.

Case Study: The Stateful API Anti-pattern

Another issue we've seen: engineers writing access patterns that violate statelessness.

In one case, a typical "resource list" endpoint fetched thousands of rows into memory, performed several in-memory joins and filters, and only then returned a filtered list of the desired dataset. This design:

Consumed unnecessary CPU and memory,
Made latency sensitive to input size,
Ignored paging and indexing strategies that would have solved the problem at the source.

Instead, we rewrote the query to:

Push filters and joins into the database,
Implement pagination at the source,
Avoid pulling large blobs of data into memory.

At the expense of writes (denormalizing data), this returned the endpoint to sub-100ms latencies and reduced CPU usage by 80%.

While design docs and good team processes can catch these before code is written (far before deployed), it is a reality for growing organizations experiencing growing pains.

Final Thoughts

Scaling Node applications is about setting and defending boundaries. If the API is lightweight, async, and stateless, then target CPU utilization becomes a powerful standard for scaling. But left unchecked, entropy will erode this foundation.

We must establish both the guidelines and the guardrails:

Educate teams on the event loop model.
Codify performance expectations (e.g., API SLOs).
Invest in dashboards with the right metrics.
Run incident reviews not just for outages, but for SLO misses.

Scaling Node is less about the container and more about the contract. Play defense early, and the system will scale predictably. Ignore the warning signs, and you'll be firefighting late into the night.

Scaling Node Applications: Playing Defense