Auto recovery after server crash. #58

azalio · 2020-02-07T15:50:19Z

Hello!
As I know, if my server completely broke, linstor plugin won't setup automatically another copy in other server and I need some manual action to recovery.
Can you make it happen without my manual action?
Unfortunately, I didn't work with linstor, but I know If I use ceph for example it will be done without my intervention.

rck · 2020-02-11T14:02:51Z

I'm not sure if this is something that even should be handled at this level. To me the CSI driver is pretty stupid and just reacts to attach/detach/what not. IMO it simply should not even try to magically reschedule stuff in the cluster. IMO it has to be told to do stuff. Maybe that is something for a k8s operator? @w00jay any opinion from a higher level k8s/operator point of view? Is this something the operator could handle (in the very long run) or am I wrong and that should be part of the CSI driver somehow?

w00jay · 2020-02-11T16:18:36Z

As @rck mentioned, the CSI driver does not, and cannot create a new volume or replica unless the underlying LINSTOR cluster is already provisioned. The CSI driver does not have any reactive capability.

Even w/ the current operator, volume placement on creation is at-best-effort w/o guarantee as this is the level of service provided by the CSI framework and the k8s. We are working toward resolving this issue at the operator in the long term, but I'm afraid this is not possible at this time.

azalio · 2020-02-11T16:42:54Z

Thank you for reply!
As I understand, If my server died I will need do manual action and linstor operator can't help me with it?

w00jay · 2020-02-11T18:38:22Z

If a 'server dies' as in 'a k8s node fails,' AND does not come back, the current implementation of k8s and CSI framework most likely cannot deal w/ it very well. Most likely a statefulset controller cannot be sure if the node will ever come back, and will never drop the connection.

Our operator at this stage, is only concerned w/ registration of new LINSTOR storage nodes and deploying new PVs onto those nodes. An attached storage on a failed node even with the operator, cannot be moved to a new storage node w/o additional intervention in a k8s controller logic which does not exist in the LINSTOR operator at this time.

kvaps · 2020-03-23T15:04:10Z

I think this issue is more likely about auto-recovering feature for the linstor-controller, so better to move it to proper project:

https://github.com/LINBIT/linstor-server

kvaps · 2020-10-10T20:39:00Z

I guess this issue was fixed by implementing k8s-await-election for the linstor-controller, see piraeusdatastore/piraeus-operator#56 and piraeusdatastore/piraeus-operator#73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto recovery after server crash. #58

Auto recovery after server crash. #58

azalio commented Feb 7, 2020

rck commented Feb 11, 2020

w00jay commented Feb 11, 2020

azalio commented Feb 11, 2020

w00jay commented Feb 11, 2020

kvaps commented Mar 23, 2020

kvaps commented Oct 10, 2020

Auto recovery after server crash. #58

Auto recovery after server crash. #58

Comments

azalio commented Feb 7, 2020

rck commented Feb 11, 2020

w00jay commented Feb 11, 2020

azalio commented Feb 11, 2020

w00jay commented Feb 11, 2020

kvaps commented Mar 23, 2020

kvaps commented Oct 10, 2020