Top 10 Kubernetes Performance Pitfalls (and how to avoid them)

August 8, 2025

Share this post

Top 10 Kubernetes Performance Pitfalls (and how to avoid them)

If you are a cloud native company, you already know that Kubernetes is a solid choice for deploying and managing modern applications at scale. It’s flexible, powerful, and capable of handling complex workloads. However, there is also the potential for serious performance headaches. 

Many people think performance optimization is just about cutting costs or squeezing efficiency out of resources. That’s part of it, but not all of it. It’s also about delivering reliable services without over-provisioning. There is a fine line between balancing performance, reliability, and costs. Nobody wants a cheap Kubernetes environment if it’s always down (or slow). That only frustrates users. Nobody wants a robust Kubernetes environment that costs so much to run that it defeats the purpose.

Kubernetes optimization is more than tuning the cluster (sorry, platform engineers). It’s a holistic effort that involves many layers, including the cluster, the workloads and application runtimes like the JVM running on it. Focusing on just one piece leaves significant optimization opportunities on the table. It’s a layered stack: the infrastructure (nodes, networking, storage), the control plane (scheduling, autoscaling), and the applications themselves. Each layer needs attention. 

In this article, we’re going to look at the top performance pitfalls that we see on a regular basis from customers in their Kubernetes journey. We’ll show you how to avoid these antipatterns and share practical tips to keep your deployments smooth. 

1. Settings CPU Limits Based on Theory Rather Than Experience

By far, pod resource requests and limits are the biggest challenge K8s teams face. If you’re not familiar, we described how requests and limits work and their impact on costs and reliability in a previous post.

One of the most debated aspects is CPU limits: is it best to set them for your workloads or avoid using them and just live with CPU requests? CPU limits can cause CPU throttling and result in significant application performance degradation, such as response time slowdowns, even if the CPU usage is low compared to CPU limits.

But CPU limits can also be critical in ensuring cluster stability and application performance in shared or multi-tenant clusters. So, as a developer or operator of K8s apps with the goal of ensuring application reliability and efficiency, what should you do?

To fully understand how CPU limits actually work, we conducted performance tests under CPU contention scenarios. We measured the impact on application performance when a misbehaving job, co-located on the same node, saturates the node CPUs. We tested this in two scenarios: with CPU requests only (no CPU limits), and with both CPU requests and limits set.

The result? CPU requests are not enough to protect latency-sensitive workloads from misbehaving pods: the application p95 latency gets 3x higher. With CPU limits instead, application performance increases by only 30%, as the CPU-hungry job can’t saturate the node CPUs.

Why is that? This might sound counterintuitive, especially as many people expect CPU requests to be there to prevent such situations. But there is nothing wrong here: that’s exactly how CPU resource management works on K8s (courtesy of Linux cgroups), and failing to understand that can have pretty serious consequences on your app performance and stability, as others noted too. More details in an upcoming blog.

2. Picking the Wrong Instance types for your Nodes

Kubernetes provides built-in autoscalers at the cluster level, automatically adding nodes when demand increases and removing them when traffic dies down. The Cluster Autoscaler (or Karpenter) monitors resource requests from pending pods and spins up new nodes to accommodate them, then scales down during quiet periods to save costs.

A common issue SREs and platform engineers face is failing to keep node groups’ configuration updated to match workload resource requirements. Typically, teams simply configure node groups with standard cloud instance types, and they never revisit the choice to assess if the CPU vs memory (or GPU) ratio matches the workload shape.

This can result in a highly inefficient resource allocation at the cluster level, with significant CPU or memory resources (hence costs) being wasted. What often happens is that the cluster autoscaler adds nodes in response to pending pods, but just one resource (e.g. memory) is exhausted, while the other (e.g. CPU) may have plenty of available (and wasted) capacity.

Let’s look at an example. In this cluster, the cluster autoscaler creates nodes due to a memory shortage (right chart). Doing so, it allocates ~5000 CPUs that go wasted as they are not requested by the workloads (left chart). 

Notice that the cluster autoscaler works perfectly here! Nevertheless, a huge amount of resources and costs are wasted due to the bad configuration of the instance types in the underlying node groups.

Choosing node types and scaling your cluster is like picking the right rental car for a vacation road trip. Too small, and everyone is cramped. There’s no room for luggage, and it’s miserable. If it’s too big, you’re burning way too much fuel, and every stop at the gas station leaves you and your wallet in shock. If you misjudge your node instance types or fail to scale dynamically, you’ll either hit bottlenecks or waste a lot of resources.

3. JVM Misconfigurations for Java Applications

Java applications running on Kubernetes can be particularly tricky. The Java Virtual Machine (JVM) is a highly configurable engine with 500+ options, which can be tuned to improve performance and resource efficiency. But JVM tuning requires deep expertise and is notoriously a hard task to do. Even a common activity like determining the optimal heap memory size for your app can be surprisingly complex to answer for Java developers.

Containers and Kubernetes add an extra layer of complexity: how should you configure your memory limit and heap size of your JVM? And how about CPU limits and garbage collector type? To achieve high-performance, reliable, and efficient Java apps, JVM resource management needs to be aligned with Kubernetes pod resource settings. But despite the JVM container-aware heuristics trying to auto-adapt the JVM behaviour to the container resources, in practice the default configurations are wasteful, prone to reliability issues, or cause unnecessary performance slowdowns.

Let’s look at a real-life example. In the chart below, you can see an out-of-memory kill event (OOMKill) that impacted application availability. Looking at the steady increase in container memory usage, the SRE team attributed this problem to a memory leak in the application. Time to fix your code, dear development team! But in reality, there was no memory leak here. The issue turned out to be a JVM misconfiguration: the pod memory limit was not properly set to account for the JVM memory demand, which includes both the heap and off-heap memory.  

4. Poor application performance due to wrong JVM heap sizing

In the previous pitfall we talked about the importance of jointly optimizing your pods’ resources and your application runtime to avoid reliability issues.

But a wrong JVM configuration can significantly impact application performance too! Many developers think that the JVM achieves top performance out-of-the-box, especially thanks to the container awareness features, which allow the JVM to determine the heap size automatically from your pod memory limits.

But the reality is that – whether you set your JVM heap explicitly (e.g. via -Xmx) or leverage container awareness (e.g. via MaxRAMPercentage) – the resulting heap size may well be off and slow your application substantially.

To demonstrate this concept, let’s again run some benchmarks. We used the Dacapo Java benchmarking suite, which mimics real-life Java workloads. We used the G1 garbage collector on Java 21 (eclipse-temurin), and measured the latency of the Spring Boot application under different JVM max heap size configurations. Here’s what we got:

The results are quite interesting:

  • If the heap size is too small, application performance will suffer significantly. It’s easy to have a 20% latency impact, and it can go up to 65% if the heap is too tight.
  • Beyond a certain heap size, the performance doesn’t improve further. That’s a reminder that throwing gigabytes of memory at your Java apps is not helpful – the additional memory is basically wasted.

As a Java developer looking to optimize the performance of your applications and microservices, don’t forget to properly size your JVM heap!

5. Resource Management for Node.js Applications

Node.js applications often face performance and reliability issues on Kubernetes due to their unique (and honestly, under-documented) runtime characteristics.

Node.js applications run on a managed runtime called V8, which is similar to the JVM for many aspects, including automatic memory management via garbage collection and a multi-threaded runtime architecture (yes, Node.js runtime is not single-threaded!).

A common problem for Node.js applications is an out-of-memory kill (OOMKill). This is similar to pitfall #3 for the JVM: most teams misinterpret OOMKills as a memory leak, while in reality, it’s related to the way the heap memory is sized in relation to the container memory limits. 

What is less known is that Node.js application performance and resource usage is also very sensitive with respect to V8 garbage collection and related heap settings. This is an example from an earlier blog where we benchmarked a Node.js app with different sizes of the V8 heap memory pools. As you can see, application latency can be cut by almost 50% by properly tuning V8 heap memory pools, with no code changes. Interestingly, similar gains can be achieved on CPU usage, which means savings on infrastructure footprint and costs.

Check the blog for full details.

6. Misconfigured HPA Autoscaling (Autoscaling Gone Wild)

Kubernetes provides built-in application-level scaling via the Horizontal Pod Autoscaler (HPA), adding pods when demand increases and removing them when traffic dies down. HPA can be a lifesaver, but it’s not a “set it and forget it” solution.

HPA requires careful tuning to identify the best scaling metrics, thresholds, and periods. Scaling behaviour is highly application and traffic-dependent. For example, is your application’s startup time quick or slow? Are you expected to react to sudden traffic peaks, or traffic increases slowly? Depending on your app performance requirements or SLOs, HPA needs to be configured accordingly.

Teams often don’t get HPA set up right, and we regularly see deployments that scale too slowly, too aggressively, or do not properly scale down. If your HPA is not properly tuned, it will:

  • impact end user performance
  • cause unnecessary SLO breaches and production incidents
  • waste a lot of resources
  • cause the cluster autoscaler to provision unnecessary nodes

Look at this example. This team rightfully adopted HPA for a deployment with fluctuating traffic patterns. However, the scaling threshold was set way too low. As a result, HPA makes the deployment scale way too much, and the overall HPA scaling efficiency (CPU used / CPU requested) was just 9% here, resulting in significant resource and cost wastage.

So the key lesson here is that, similar to the cluster-level autoscaling, simply adding HPA to your workloads may not solve the scalability and efficiency goals you are looking for.

7. VPA Missteps with Requests and Limits

The Vertical Pod Autoscaler (VPA) is often the most recommended tool to set pod resource requests & limits. VPA is a project that automatically sets container resource requests & limits based on observed resource usage. 

While this sounds exactly like what is needed to solve the problem, the reality is different. VPA has a number of known limitations that many teams are not aware of, and can make it inapplicable in real-world production environments.

There is a very important limitation that is not even mentioned in the docs: the VPA recommendation algorithm takes a purely infrastructure-only approach. It simply looks at resource usage and adjusts resources based on that, without any consideration of the application that runs inside the pod.

Why is that a problem? Modern apps run on a runtime that manages resources like heap memory and CPU threads. Such runtimes like the JVM or Node.js V8 automatically adjust based on pod resource limits. The VPA is unaware of how application runtimes work, resulting in performance and reliability issues if VPA recommendations are applied without careful consideration.

Infrastructure-only approaches like the VPA can also inadvertently slow down your application. Let’s look at this test we did with a customer. The team applied VPA recommendations to a Java app and measured the response time before and after the change. As you can see, the change introduced significant application performance slowdowns, making it unresponsive to the users.

Why did that happen? Check out this blog if you want to know more about why it’s crucial to consider application runtimes running in containers and why infrastructure-level metrics are not enough. By the way, this is true for any tool and approach that simply looks at container-level resource metrics, not just the VPA.

8. Pod Sizing for Disaster Recovery (DR) and High Availability (HA)

Poorly sized pods in disaster recovery or high availability scenarios can jeopardize the resilience of your clusters. If pods aren’t configured to handle failover or spikes, you risk downtime or degraded performance during critical moments.

This is a common challenge we see for SREs and development teams trying to come up with the best resource allocation strategy for their workloads. Typically, teams approach pod resource allocation considering the current observed workload. For example, you see that your pod has 4 CPU requests, but is using just 1 CPU. You want to make it more efficient and save costs, hence you downsize the pod to match the 1 CPU demand.

The problem with this approach is that it doesn’t consider your application reliability, high availability, and business continuity requirements. Your applications may be required to withstand the failure of one or more data centers, to ensure response time is good in the event of infrastructure disruptions.

Properly sizing your workloads in such degraded scenarios requires simulating the impact of 2x the traffic on your workloads. Again, it’s important not only to consider the pod resources, but crucially, your application runtime, like the JVM heap memory, needs to be considered as well, to avoid unexpected bottlenecks during critical moments.

9. Configuration Chaos and Drift

Technical aspects are not the only challenges teams face when adopting Kubernetes. A more organizational/process issue that we observed, especially in mid to large organizations, is that developers push frequent changes to production, and this can introduce performance regressions or outright break things in the whole cluster.

Part of the problem is that developers are not necessarily well-versed in the Kubernetes stack. Rightly so: as we have shown in this blog, properly configuring your apps and clusters for efficiency and performance is no small feat. Developers’ goal is typically to ship new features to the market quickly, not being infrastructure experts.

At the same time, SREs and platform engineers have to cope with the issues. Oftentimes, poor configurations of even a single workload have the ability to impact other applications running on the same node (as we demonstrated in pitfall #1 on CPU limits) or even the entire cluster (e.g. if a node crashes).

Kubernetes is a great platform, but every user needs to be a good citizen and use the platform wisely to ensure all the other users can do the same. At scale, proper performance isolation and reliability practices are key. It turns out, this does not always happen in practice. The only way to ensure the platform remains stable, performant, and efficient is to adopt proper processes and tools to automate the optimization work as you scale.

10. Organizational Knowledge Gaps and Prioritization Conflicts

Kubernetes thrives on collaboration, but when priorities are not aligned between platform engineering and the application teams, it creates a deadlock that affects both performance and reliability. When application teams lack Kubernetes expertise and prioritize speed in delivering features over stability, it creates major problems.

It typically goes like this:

  • Platform engineers understand Kubernetes intricacies, but application teams may lack the training to optimize for their deployment. This leads to misconfigured resources or ignored best practices.
  • Application teams focus on shipping features quickly, sidelining critical tasks like performance tuning, monitoring setup, or adopting Kubernetes-native practices.
  • Without good collaboration, platform teams may enforce overly restrictive policies, while application teams deploy suboptimal workloads. The result is friction between teams and unreliable software.

To fix this problem, it’s important to invest in cross-team training to bridge the Kubernetes knowledge gap. Offer workshops or leverage tools like Katacoda for hands-on learning tailored to application teams. Establish clear service-level objectives (SLOs) that align platform and application priorities, balancing feature velocity with reliability goals.

Tools to Save the Day

Here are some tools and techniques to help you tune Kubernetes for better performance:

  • Observability: Use tools like Prometheus, OpenTelemetry, or commercial tools like Dynatrace and Datadog to understand resource usage. Don’t stop at workload-level metrics however – make sure to collect application runtimes metrics as well, as they are critical for success as we have shown in this post.
  • AI-Powered Tuning: Leverage tools to automate full-stack Kubernetes optimization to reduce effort and skills required on engineering teams. Akamas autonomously optimizes the full-stack configuration of enterprise applications – from infrastructure to application layers – by leveraging reinforcement learning, live telemetry, and user-defined goals, both live in production and offline in testing environments. 
  • Load Testing: Simulate traffic with tools like Locust, JMeter, or commercial tools like Micro Focus LoadRunner and Tricentis NeoLoad. Load testing is often the best approach to find scalability bottlenecks and optimize your workloads configurations like the JVM or HPA before you deploy to production.
  • Chaos Engineering: Use tools like Chaos Toolkit, Chaos Mesh, or commercial tools like Gremlin to test resilience, where the unexpected worst cases can happen. This is helpful to make sure applications can sustain traffic in disaster recovery scenarios.
  • Advanced Kubernetes features: Leverage advanced Kubernetes capabilities like taints, tolerations, affinity rules, priority classes, resource quotas, and HPAs to help you achieve reliable and efficient clusters.

Wrapping It Up

Kubernetes is a great platform to scale your cloud applications on, and we love it! But performance and efficiency aren’t going to come for free. You can build clusters that are fast, reliable, and cost-effective by avoiding these common pitfalls. The key is to stay proactive: monitor relentlessly, tune continuously, and create a performance engineering culture where everyone cares about performance within their own scope of work.

If you are struggling with Kubernetes performance, we can help. Check out Akamas Insights to explore how our AI-driven optimization platform can help you tackle all of these challenges and save on costs. We understand the kinds of problems that companies face when deploying Kubernetes at scale. We can help you too.

Author: Scott Moore