Remove resource limits for the observability components #9785

istvanballok · 2024-05-17T08:46:01Z

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

Some time ago we decided to avoid using resource limits but somehow it was not consistently implemented for these components.

With this change the limits are removed for the observability components. Even if there is a momentary anomaly in the vertical pod autoscaling, the pod will be able to start and the autoscaler will correct the resource requests in the next iteration.

Which issue(s) this PR fixes:

The trigger for this change is a non self healing condition that we observed recently: when the vertical pod autoscaling has an anomaly (not understood) and it scales the resource requests of a pod to the lower bound recommendation instead of the target recommendation (e.g. 1m core and 2.6MB for the blackbox exporter), the limits are scaled proportionally (e.g. 13MB).

The problem is that such low memory limits do not manifest in OOM failures, but the container itself can not be started in the first place because the runc process is killed due to the out of memory condition on the cgroup. This manifests in a RunContainerError in the pod status and can not self heal.

State:          Waiting
  Reason:       RunContainerError
Last State:     Terminated
  Reason:       StartError
  Message:      failed to create containerd task:
                failed to create shim task: context canceled: unknown

Special notes for your reviewer:

cc @vicwicker @rickardsjp
fyi @dguendisch @ialidzhikov

Release note:

Resource limits are removed for the observability components

Some time ago we decided to avoid using resource limits but somehow it was not consistently implemented for these components. The trigger for this change is a non self healing condition that we observed recently: when the vertical pod autoscaling has an anomaly (not understood) and it scales the resource requests of a pod to the lower bound recommendation instead of the target recommendation (e.g. 1m core and 2.6MB for the blackbox exporter), the limits are scaled proportionally (e.g. 13MB). The problem is that such low memory limits do not manifest in OOM failures, but the container itself can not be started in the first place because the `runc` process is killed due to the out of memory condition on the cgroup. This manifests in a RunContainerError in the pod status and can not self heal. State: Waiting Reason: RunContainerError Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: context canceled: unknown With this change the limits are removed. Even if there is a momentary anomaly in the vertical pod autoscaling, the pod will be able to start and the autoscaler will correct the resource requests in the next iteration.

vicwicker · 2024-05-17T09:50:40Z

/lgtm

I checked in the prometheus-operator documentation that setting value 0 is what we want:

-config-reloader-memory-limit value
Config Reloader memory limits. Value "0" disables it and causes no limit to be configured. (default 50Mi)

gardener-prow · 2024-05-17T09:50:45Z

LGTM label has been added.

Git tree hash: e8a629897bac70cbecb91f81db4e36bb83d0d899

rfranzke · 2024-05-21T07:21:41Z

/approve

gardener-prow · 2024-05-21T07:21:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [rfranzke]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ener#9785)" This reverts commit 0be0fdc.

gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels May 17, 2024

gardener-prow bot requested review from ialidzhikov and shafeeqes May 17, 2024 08:46

gardener-prow bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 17, 2024

gardener-prow bot assigned vicwicker May 17, 2024

gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2024

gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2024

gardener-prow bot merged commit 0be0fdc into gardener:master May 21, 2024
18 checks passed

ialidzhikov added a commit to ialidzhikov/gardener that referenced this pull request Jun 3, 2024

Revert "Remove resource limits for the observability components (gard…

1bc318d

…ener#9785)" This reverts commit 0be0fdc.

ialidzhikov mentioned this pull request Jun 3, 2024

Revert "Remove resource limits for the observability components (#9785)" #9893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove resource limits for the observability components #9785

Remove resource limits for the observability components #9785

istvanballok commented May 17, 2024

vicwicker commented May 17, 2024

gardener-prow bot commented May 17, 2024

rfranzke commented May 21, 2024

gardener-prow bot commented May 21, 2024

Remove resource limits for the observability components #9785

Remove resource limits for the observability components #9785

Conversation

istvanballok commented May 17, 2024

vicwicker commented May 17, 2024

gardener-prow bot commented May 17, 2024

rfranzke commented May 21, 2024

gardener-prow bot commented May 21, 2024