Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove resource limits for the observability components #9785

Merged
merged 1 commit into from
May 21, 2024

Conversation

istvanballok
Copy link
Contributor

How to categorize this PR?

/area monitoring
/kind enhancement

What this PR does / why we need it:

Some time ago we decided to avoid using resource limits but somehow it was not consistently implemented for these components.

With this change the limits are removed for the observability components. Even if there is a momentary anomaly in the vertical pod autoscaling, the pod will be able to start and the autoscaler will correct the resource requests in the next iteration.

Which issue(s) this PR fixes:

The trigger for this change is a non self healing condition that we observed recently: when the vertical pod autoscaling has an anomaly (not understood) and it scales the resource requests of a pod to the lower bound recommendation instead of the target recommendation (e.g. 1m core and 2.6MB for the blackbox exporter), the limits are scaled proportionally (e.g. 13MB).

The problem is that such low memory limits do not manifest in OOM failures, but the container itself can not be started in the first place because the runc process is killed due to the out of memory condition on the cgroup. This manifests in a RunContainerError in the pod status and can not self heal.

State:          Waiting
  Reason:       RunContainerError
Last State:     Terminated
  Reason:       StartError
  Message:      failed to create containerd task:
                failed to create shim task: context canceled: unknown

Special notes for your reviewer:

cc @vicwicker @rickardsjp
fyi @dguendisch @ialidzhikov

Release note:

Resource limits are removed for the observability components

Some time ago we decided to avoid using resource limits but somehow it was not
consistently implemented for these components.

The trigger for this change is a non self healing condition that we observed
recently: when the vertical pod autoscaling has an anomaly (not understood) and
it scales the resource requests of a pod to the lower bound recommendation
instead of the target recommendation (e.g. 1m core and 2.6MB for the blackbox
exporter), the limits are scaled proportionally (e.g. 13MB).

The problem is that such low memory limits do not manifest in OOM failures, but
the container itself can not be started in the first place because the `runc`
process is killed due to the out of memory condition on the cgroup. This
manifests in a RunContainerError in the pod status and can not self heal.

    State:          Waiting
      Reason:       RunContainerError
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task:
                    failed to create shim task: context canceled: unknown

With this change the limits are removed. Even if there is a momentary anomaly in
the vertical pod autoscaling, the pod will be able to start and the autoscaler
will correct the resource requests in the next iteration.
@gardener-prow gardener-prow bot added area/monitoring Monitoring (including availability monitoring and alerting) related kind/enhancement Enhancement, improvement, extension cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. labels May 17, 2024
@gardener-prow gardener-prow bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 17, 2024
@vicwicker
Copy link
Contributor

/lgtm

I checked in the prometheus-operator documentation that setting value 0 is what we want:

-config-reloader-memory-limit value
    Config Reloader memory limits. Value "0" disables it and causes no limit to be configured. (default 50Mi)

@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 17, 2024
Copy link
Contributor

gardener-prow bot commented May 17, 2024

LGTM label has been added.

Git tree hash: e8a629897bac70cbecb91f81db4e36bb83d0d899

@rfranzke
Copy link
Member

/approve

Copy link
Contributor

gardener-prow bot commented May 21, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gardener-prow gardener-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2024
@gardener-prow gardener-prow bot merged commit 0be0fdc into gardener:master May 21, 2024
18 checks passed
ialidzhikov added a commit to ialidzhikov/gardener that referenced this pull request Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/monitoring Monitoring (including availability monitoring and alerting) related cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants