Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't resuse volume after delayed deletion #89

Open
kvaps opened this issue Oct 10, 2020 · 1 comment
Open

Can't resuse volume after delayed deletion #89

kvaps opened this issue Oct 10, 2020 · 1 comment

Comments

@kvaps
Copy link
Member

kvaps commented Oct 10, 2020

Hi, we're using latest STORK plugin from the upstream, by default it is coming with health-monitor enabled:

   --health-monitor                           Enable health monitoring of the storage driver (default: true)

And today we faced with the painful issue. We have many nodes, sometimes some of them might be overloaded, they are flapping between Online and OFFLINE state.

STORK detects these nodes and trying to reattach the volumes and restart the pods on place, example log message:

time="2020-10-10T19:46:16Z" level=info msg="Deleting Pod from Node m9c17 due to volume driver status: Offline ()" Namespace=hosting Owner=ReplicaSet/hc1-wd48-678d9888fb PodName=hc1-wd48-678d9888fb-p8gck

This causes really weird behavior from the linstor-csi driver:

Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               10m                   default-scheduler        Successfully assigned hosting/hc1-wd48-678d9888fb-fsmcq to m9c17
  Warning  FailedMount             9m39s (x11 over 10m)  kubelet, m9c17           MountVolume.WaitForAttach failed for volume "pvc-ddd150c5-94eb-48a2-9126-4d1339811752" : volume attachment is being deleted
  Warning  FailedMount             9m35s (x10 over 10m)  kubelet, m9c17           MountVolume.SetUp failed for volume "pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = NodePublishVolume failed for pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4: checking device path failed: path "" does not exist
  Warning  FailedMount             9m7s                  kubelet, m9c17           MountVolume.WaitForAttach failed for volume "pvc-ddd150c5-94eb-48a2-9126-4d1339811752" : volume pvc-ddd150c5-94eb-48a2-9126-4d1339811752 has GET error for volume attachment csi-9ff6fcc944f9e40da6106d5175b34c3e53f7449ee0a990f6c2c69ba07764d9e1: volumeattachments.storage.k8s.io "csi-9ff6fcc944f9e40da6106d5175b34c3e53f7449ee0a990f6c2c69ba07764d9e1" is forbidden: User "system:node:m9c17" cannot get resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope: no relationship found between node "m9c17" and this object
  Warning  FailedMount             8m20s                 kubelet, m9c17           Unable to attach or mount volumes: unmounted volumes=[vol-data-backup vol-data-web], unattached volumes=[wd48-vol-data-global run vol-data-backup wd48-vol-shared default-token-jt2jk cgroup fuse vol-data-web wd48-vol-data-proxy]: timed out waiting for the condition
  Normal   SuccessfulAttachVolume  8m14s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ddd150c5-94eb-48a2-9126-4d1339811752"
  Warning  FailedMount             3m55s (x4 over 9m3s)  kubelet, m9c17           MountVolume.SetUp failed for volume "pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = NodePublishVolume failed for pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4: 404 Not Found

The volume might stuck on DELETING:

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port  ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4 ┊ m9c17 ┊ 55207 ┊        ┊ Ok    ┊ DELETING ┊           ┊
┊ pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4 ┊ m5c18 ┊ 55207 ┊ Unused ┊ Ok    ┊ UpToDate ┊           ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

The csi-attacher logs says:

I1010 16:40:34.311031       1 request.go:581] Throttling request took 1.950726244s, request: PATCH:https://10.96.0.1:443/apis/storage.k8s.io/v1/volumeattachments/csi-3df4c97cd2abde367eebc52e875e7af60add6daa8a4afe6dbbb87445ff222c8a/status
I1010 16:40:34.320360       1 csi_handler.go:612] Saved detach error to "csi-3df4c97cd2abde367eebc52e875e7af60add6daa8a4afe6dbbb87445ff222c8a"
I1010 16:40:34.320403       1 csi_handler.go:226] Error processing "csi-3df4c97cd2abde367eebc52e875e7af60add6daa8a4afe6dbbb87445ff222c8a": failed to detach: rpc error: code = Internal desc = ControllerpublishVolume failed for pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4: Message: 'Resource 'pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4' is still in use.'; Cause: 'Resource is mounted/in use.'; Details: 'Node: m9c17, Resource: pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4'; Correction: 'Un-mount resource 'pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4' on the node 'm9c17'.'; Reports: '[5F81D04D-00000-024056]'

After a while the diskless resource will be removed from the node:

╭──────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName                             ┊ Node  ┊ Port  ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn ┊
╞══════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4 ┊ m5c18 ┊ 55207 ┊ Unused ┊ Ok    ┊ UpToDate ┊           ┊
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

But volumeattachment will continue existing on the node

# kubectl get volumeattachments.storage.k8s.io | grep pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4
csi-3df4c97cd2abde367eebc52e875e7af60add6daa8a4afe6dbbb87445ff222c8a   linstor.csi.linbit.com   pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4   m9c17    true       138m

However it will not allow pod to start, because drbd device is missing, the one of possible way to fix it, is to create resource manually, to satisfy existing volumeattachment

linstor r c m9c17 pvc-712ea0dc-5378-41fc-8c8a-5db8f50c8db4 --diskless

I guess this is exact case mentioned by @rck in #52 (comment)

@kvaps kvaps changed the title Can't resuse volumeattachment after delayed deletion Can't resuse volume after delayed deletion Oct 10, 2020
kvaps added a commit to kvaps/kube-linstor that referenced this issue Oct 10, 2020
@kvaps
Copy link
Member Author

kvaps commented Oct 12, 2020

I just point this issue with the related slack thread
https://linbit-community.slack.com/archives/CPDJCHW2X/p1602491983047900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant