Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod stuck in Terminating status and won't unbind pvc #204

Open
volodaiy opened this issue Jun 23, 2023 · 5 comments
Open

Pod stuck in Terminating status and won't unbind pvc #204

volodaiy opened this issue Jun 23, 2023 · 5 comments

Comments

@volodaiy
Copy link

Hi all!

deployed linstor in HA mode on k3s (3 master (controller ) nodes), i.e. controller is installed on the master node.
what it looks like:
k3s

NAME           STATUS   ROLES                       AGE    VERSION
k3s-master01   Ready    control-plane,etcd,master   4d1h   v1.26.1+k3s1
k3s-master02   Ready    control-plane,etcd,master   4d     v1.26.1+k3s1
k3s-master03   Ready    control-plane,etcd,master   4d1h   v1.26.1+k3s1

linstor

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ Node         ┊ Diskless ┊ LVM ┊ LVMThin ┊ ZFS/Thin ┊ File/Thin ┊ SPDK ┊ EXOS ┊ Remote SPDK ┊ Storage Spaces ┊ Storage Spaces/Thin ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ k3s-master01 ┊ +        ┊ +   ┊ +       ┊ -        ┊ +         ┊ -    ┊ -    ┊ +           ┊ -              ┊ -                   ┊
┊ k3s-master02 ┊ +        ┊ +   ┊ +       ┊ -        ┊ +         ┊ -    ┊ -    ┊ +           ┊ -              ┊ -                   ┊
┊ k3s-master03 ┊ +        ┊ +   ┊ +       ┊ -        ┊ +         ┊ -    ┊ -    ┊ +           ┊ -              ┊ -                   ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

╭──────────────────────────────────────────────────────────────────────────────────────╮
┊ Node         ┊ DRBD ┊ LUKS ┊ NVMe ┊ Cache ┊ BCache ┊ WriteCache ┊ OpenFlex ┊ Storage ┊
╞══════════════════════════════════════════════════════════════════════════════════════╡
┊ k3s-master01 ┊ +    ┊ +    ┊ -    ┊ +     ┊ +      ┊ +          ┊ -        ┊ +       ┊
┊ k3s-master02 ┊ +    ┊ +    ┊ -    ┊ +     ┊ +      ┊ +          ┊ -        ┊ +       ┊
┊ k3s-master03 ┊ +    ┊ +    ┊ -    ┊ +     ┊ +      ┊ +          ┊ -        ┊ +       ┊
╰──────────────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node         ┊ Port ┊ Usage  ┊ Conns ┊    State ┊ CreatedOn           ┊
╞══════════════════════════════════════════════════════════════════════════════════════╡
┊ linstor_db   ┊ k3s-master01 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2023-06-21 15:58:38 ┊
┊ linstor_db   ┊ k3s-master02 ┊ 7000 ┊ Unused ┊ Ok    ┊ UpToDate ┊ 2023-06-21 15:58:38 ┊
┊ linstor_db   ┊ k3s-master03 ┊ 7000 ┊ InUse  ┊ Ok    ┊ UpToDate ┊ 2023-06-21 15:58:39 ┊
╰──────────────────────────────────────────────────────────────────────────────────────╯
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: demo-vol-claim-0
  namespace: default
spec:
  storageClassName: linstor-r3
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo-pod-0
  namespace: default
spec:
  selector:
    matchLabels:
      app: myapp
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: demo-pod-0
        image: busybox
        command: ["/bin/sh"]
        args: ["-c", "while true; do sleep 1000s; done"]
        volumeMounts:
          - mountPath: "/data"
            name: demo-vol
      volumes:
        - name: demo-vol
          persistentVolumeClaim:
            claimName: demo-vol-claim-0
  • but when I try to check HA and disable one node, it freezes in Terminating.
NAME                          READY   STATUS              RESTARTS   AGE
demo-pod-0-5b87665bc8-7xnpz   1/1     Terminating         0          8m8s
demo-pod-0-5b87665bc8-f9l6k   0/1     ContainerCreating   0          23s
  • the old pod cannot unmount pvc, and the new pod cannot mount because of this:
    Warning FailedAttachVolume 118s attachdetach-controller Multi-Attach error for volume "pvc-3af7a2db-aff2-4983-b69d-2e191695c328" Volume is already used by pod(s) demo-pod-0-5b87665bc8-7xnpz

Whether it is possible to solve it somehow?

@WanzenBug
Copy link
Member

I assume by "disable one node" you mean you forced the node to shut down or something along those lines?

In that case, you might want to look into https://github.com/piraeusdatastore/piraeus-ha-controller

Can I recommend using the operator instead of deploying and managing LINSTOR manually? It also comes with the ha-controller deployed out of the box.

@volodaiy
Copy link
Author

volodaiy commented Jun 23, 2023

I assume by "disable one node" you mean you forced the node to shut down or something along those lines?

In that case, you might want to look into https://github.com/piraeusdatastore/piraeus-ha-controller

Can I recommend using the operator instead of deploying and managing LINSTOR manually? It also comes with the ha-controller deployed out of the box.

Yes, right.
I turned off the power at the node.

Thanks for the advice, I'll give it a try!
After the test I will write here the result, if it is positive I will close it with a comment.

@volodaiy
Copy link
Author

volodaiy commented Jun 26, 2023

I assume by "disable one node" you mean you forced the node to shut down or something along those lines?

In that case, you might want to look into https://github.com/piraeusdatastore/piraeus-ha-controller

Can I recommend using the operator instead of deploying and managing LINSTOR manually? It also comes with the ha-controller deployed out of the box.

Hello!
I installed linstor via operator according to your instructions: https://github.com/piraeusdatastore/piraeus-operator/tree/v2/docs/tutorial

In normal deployment, without replication, everything works. But when I want to do replication, when deploying the deployment example, the disks are not used and are in the "Unused" status. Also, when deleting this example, the mounted volumes should be automatically deleted, but this does not happen, I stumble upon an error: 1000: State change failed: (-2) Need access to UpToDate data.

kubectl -n piraeus-datastore exec deploy/linstor-controller -- linstor resource list      
+--------------------------------------------------------------------------------------------------------------------+
| ResourceName                             | Node         | Port | Usage  | Conns |      State | CreatedOn           |
|====================================================================================================================|
| pvc-0ba1ddf5-8b4e-47d5-8d7c-5d02514c02df | k3s-master01 | 7000 | Unused | Ok    |   UpToDate | 2023-06-26 08:45:44 |
| pvc-0ba1ddf5-8b4e-47d5-8d7c-5d02514c02df | k3s-master02 | 7000 | Unused | Ok    |   UpToDate | 2023-06-26 08:45:47 |
| pvc-0ba1ddf5-8b4e-47d5-8d7c-5d02514c02df | k3s-master03 | 7000 | Unused | Ok    | TieBreaker | 2023-06-26 08:45:47 |
+--------------------------------------------------------------------------------------------------------------------+
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                 AGE
replicated-volume   Bound    pvc-0ba1ddf5-8b4e-47d5-8d7c-5d02514c02df   1Gi        RWO            piraeus-storage-replicated   9m9s

Before deploying linstor through operator, do you need any preparation on worker nodes?

@volodaiy
Copy link
Author

volodaiy commented Jun 26, 2023

Additionally:
saw a recurring error in journalctl:

09:50:27 k3s-master01 k3s[828]: E0626 09:50:27.401454     828 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"1fb74a3f-38e3-40a1-a9a7-565cd46d9342\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1

after manually deleting this directory from /var/lib/kubelet/pods/ the error disappeared and pvc became in "InUse" status.
perhaps this is related to kubernetes/kubernetes#105536

@volodaiy
Copy link
Author

Additionally: saw a recurring error in journalctl:

09:50:27 k3s-master01 k3s[828]: E0626 09:50:27.401454     828 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"1fb74a3f-38e3-40a1-a9a7-565cd46d9342\" found, but error occurred when trying to remove the volumes dir: not a directory" numErrs=1

after manually deleting this directory from /var/lib/kubelet/pods/ the error disappeared and pvc became in "InUse" status. perhaps this is related to kubernetes/kubernetes#105536

this was a one-time issue, it hasn't happened again, but the problem with the "Unused" status remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants