-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove resource limits for the observability components #9785
Conversation
Some time ago we decided to avoid using resource limits but somehow it was not consistently implemented for these components. The trigger for this change is a non self healing condition that we observed recently: when the vertical pod autoscaling has an anomaly (not understood) and it scales the resource requests of a pod to the lower bound recommendation instead of the target recommendation (e.g. 1m core and 2.6MB for the blackbox exporter), the limits are scaled proportionally (e.g. 13MB). The problem is that such low memory limits do not manifest in OOM failures, but the container itself can not be started in the first place because the `runc` process is killed due to the out of memory condition on the cgroup. This manifests in a RunContainerError in the pod status and can not self heal. State: Waiting Reason: RunContainerError Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: context canceled: unknown With this change the limits are removed. Even if there is a momentary anomaly in the vertical pod autoscaling, the pod will be able to start and the autoscaler will correct the resource requests in the next iteration.
/lgtm I checked in the prometheus-operator documentation that setting value 0 is what we want:
|
LGTM label has been added. Git tree hash: e8a629897bac70cbecb91f81db4e36bb83d0d899
|
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rfranzke The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ener#9785)" This reverts commit 0be0fdc.
How to categorize this PR?
/area monitoring
/kind enhancement
What this PR does / why we need it:
Some time ago we decided to avoid using resource limits but somehow it was not consistently implemented for these components.
With this change the limits are removed for the observability components. Even if there is a momentary anomaly in the vertical pod autoscaling, the pod will be able to start and the autoscaler will correct the resource requests in the next iteration.
Which issue(s) this PR fixes:
The trigger for this change is a non self healing condition that we observed recently: when the vertical pod autoscaling has an anomaly (not understood) and it scales the resource requests of a pod to the lower bound recommendation instead of the target recommendation (e.g. 1m core and 2.6MB for the blackbox exporter), the limits are scaled proportionally (e.g. 13MB).
The problem is that such low memory limits do not manifest in OOM failures, but the container itself can not be started in the first place because the
runc
process is killed due to the out of memory condition on the cgroup. This manifests in aRunContainerError
in the pod status and can not self heal.Special notes for your reviewer:
cc @vicwicker @rickardsjp
fyi @dguendisch @ialidzhikov
Release note: