Preventing Kubernetes outages before they happen

Share this post

Preventing Kubernetes outages before they happen

How a large retail organization could have avoided a major production incident through proactive optimization with Akamas.

The company

The company featured in this case is a global leader in digital services and real-time transactional platforms in the retail market. It manages a complex technology landscape, supporting millions of customer interactions daily across thousands of physical and digital endpoints.

Over the past decade, the company has made a major shift toward cloud-native architectures, leveraging Kubernetes and Java-based microservices to support business-critical applications, including real-time customer authentication, transaction processing, and device management.

This organization is known for its strong innovation DNA, continuously evolving its infrastructure to meet new demands while maintaining rigorous standards of service continuity, reliability, and compliance. It operates multiple production-grade clusters with dedicated environments for business continuity and disaster recovery, ensuring fault tolerance for even the most demanding workloads.

Despite its engineering maturity, the company, like many at scale, faced challenges in resource efficiency and configuration safety. Default settings, evolving workloads, and rapid deployment cycles had created an environment where invisible risks could accumulate beneath the surface, eventually leading to a service disruption.

This case study examines how such an incident unfolded and how it could have been avoided through proactive optimization.

The Incident

The company experienced a production outage that severely impacted one of its most business-critical applications.

The incident occurred shortly after a new software release. While the release itself did not contain breaking changes, it exposed a pre-existing fragility in the cluster’s configuration: memory usage had been silently increasing over time across multiple nodes. 

During a normal working day, one node experienced a spike in both CPU and memory usage, causing it to transition into a NotReady state (Fig. 1).

In response, Kubernetes attempted to reschedule the affected workloads across the remaining nodes. However, all nodes were already operating dangerously close to their maximum allocatable memory, even though the configured memory requests on the pods allocated on the nodes did not reflect this (Fig. 2). 

As a result, the cluster entered a cascading failure (Fig. 3):

  • Memory saturation spread to additional nodes
  • Kubernetes was unable to find healthy capacity for new pods
  • Multiple services crashed, requiring manual intervention and cluster restart

This failure highlighted a fundamental issue: the cluster’s configured capacity did not reflect its real usage, and the platform team had no early warning that a routine deploy could tip the environment over the edge.

To restore operations, the team was forced to increase the memory capacity of the nodes by 50%, effectively treating the symptoms rather than the root causes. Despite this intervention, a post-mortem analysis revealed that the risk had not been fully eliminated, and the environment remained vulnerable to future load spikes.

What if Akamas had been in place

Although this organization had already adopted Akamas, the roll-out was still in progress when the incident occurred, and that specific application was not yet covered by Akamas Autonomous Optimization.

In the aftermath of the incident, we asked ourselves a key question:

“If Akamas had been analyzing this application before the failure, could the incident have been avoided?”

To answer that, we performed a retrospective simulation.

We fed Akamas Insights with the exact telemetry data available before the incident, effectively rewinding the system to the hours and days leading up to the crash.

The results were clear: Akamas would have flagged multiple critical risks in advance, including memory usage exceeding safe thresholds, workloads operating close to OOM, and misaligned resource requests.

What’s more, Akamas would have recommended a safe reconfiguration of the affected workloads and JVM settings, actions that, if applied, would have prevented the crash without requiring infrastructure changes.

This exercise validated not only the technical root causes of the incident, but also the preventive power of Akamas when deployed early enough in the application lifecycle.

Let’s deep dive into the data.

Had Akamas been active before the deployment, it would have identified several critical risk factors that directly contributed to the outage:

  • Memory usage levels dangerously high and close to maximum node allocatable memory
  • 50% of the workloads operating with peak memory usage above 80%, indicating imminent Out-Of-Memory (OOM) conditions
  • 50% of the workloads were consuming significantly more memory than their configured requests, making them vulnerable to eviction under memory pressure as the node attempts to reclaim resources to accommodate new pods.

These conditions, while not always visible through basic observability dashboards, are precisely the types of inefficiencies and risks that Akamas is designed to uncover, well before they lead to service disruption (Fig. 4).

Akamas would have recommended:

  • Right-sizing memory requests for at-risk workloads to ensure stability under pressure
  • Reducing overprovisioned CPU requests, reclaiming unnecessary compute resources
  • Tuning JVM heap sizes to align memory usage with pod-level limits and prevent OOMs

Is the cluster stable now?

Even after the emergency fix, including a 50% memory increase and several manual changes, the question remained:
Is the cluster truly stable now?

At present, the cluster is no longer under pressure, thanks to those immediate interventions. However, a post-incident analysis with Akamas Insights reveals that several risks remain unresolved: workloads are still misconfigured, and many JVMs run with unsafe memory settings.
While the manual actions helped stabilize the environment, they didn’t align with what Akamas would have recommended, and they addressed symptoms rather than root causes.

Akamas instead suggests a smarter, more sustainable path forward: right-sizing workloads and tuning runtimes, to improve cluster stability, performance, and efficiency:

  1. Increase Memory Requests where needed

    Several workloads still have actual memory usage that exceeds their Kubernetes request values. This misalignment means that:
    • Pods are at risk of being evicted during memory pressure events
    • There’s a misperception of the available capacity on the cluster

Akamas recommends increasing memory requests for affected workloads to reflect their true usage, restoring scheduling accuracy and resilience (Fig. 5).

  1. Reduce Over-Provisioned CPU Requests

    While memory is under-requested, many workloads request more CPU than they consistently use, leading to:
    • Wasted compute resources
    • Unnecessary infrastructure costs


Akamas identifies opportunities to safely lower CPU requests (Fig. 5), reclaiming resources without impacting performance, a key step toward reducing cloud spend.

  1. Optimize JVM Heap

    Many JVM-based services are still configured with unsafe or overly aggressive heap settings, which can result in:
    • OOMKills despite high node capacity
    • GC overhead or performance degradation

Akamas analyzes live workloads behavior and recommends heap sizing adjustments that keep pods within safe memory boundaries (Fig. 6) while preserving performance.

  1. Increase reliability by designing for load scenarios

    The incident was triggered by a relatively predictable event (a release). Akamas enables teams to simulate and prepare for similar or even larger events, such as:
    • Traffic spikes
    • Failover scenarios
    • Rollouts with multiple new workloads

By configuring workloads to safely handle 2x/4x traffic levels, Akamas helps ensure the cluster remains stable without resorting to emergency scaling.

Together, these recommendations form a comprehensive, full-stack optimization strategy, not just reactive tuning. Akamas helps platform teams proactively fix risks, align configurations with reality, and ensure the environment remains both resilient and cost-efficient going forward.

Key Takeaways

This incident highlights a critical and common challenge: almost invisible risks accumulating in production environments, only to surface during routine events like a new release or minor traffic spike.

The post-mortem analysis demonstrated that:

  • The incident was avoidable: warning signs were clearly present days before the crash.
  • The emergency fix (adding memory) treated symptoms, not root causes. Workloads and JVMs remained misconfigured.
  • Eviction risks, CPU inefficiencies, and unsafe JVM settings continue to threaten the cluster’s stability.

Akamas Insights provides a clear path to mitigation. Rather than scaling up infrastructure, it recommends right-sizing existing workloads and runtimes.

Securing the environment may not require adding nodes or memory, but simply applying smarter configurations, both at the Kubernetes and JVM levels.

With Akamas, platform teams can:

  • Detect risks before incidents occur
  • Safely optimize both Kubernetes and application runtimes
  • Improve resilience without overprovisioning
  • Prevent outages while saving infrastructure costs

In short: the crash didn’t need to happen, and the next one doesn’t either.