Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto recovery after server crash. #58

Open
azalio opened this issue Feb 7, 2020 · 6 comments
Open

Auto recovery after server crash. #58

azalio opened this issue Feb 7, 2020 · 6 comments

Comments

@azalio
Copy link

azalio commented Feb 7, 2020

Hello!
As I know, if my server completely broke, linstor plugin won't setup automatically another copy in other server and I need some manual action to recovery.
Can you make it happen without my manual action?
Unfortunately, I didn't work with linstor, but I know If I use ceph for example it will be done without my intervention.

@rck
Copy link
Member

rck commented Feb 11, 2020

I'm not sure if this is something that even should be handled at this level. To me the CSI driver is pretty stupid and just reacts to attach/detach/what not. IMO it simply should not even try to magically reschedule stuff in the cluster. IMO it has to be told to do stuff. Maybe that is something for a k8s operator? @w00jay any opinion from a higher level k8s/operator point of view? Is this something the operator could handle (in the very long run) or am I wrong and that should be part of the CSI driver somehow?

@w00jay
Copy link
Contributor

w00jay commented Feb 11, 2020

As @rck mentioned, the CSI driver does not, and cannot create a new volume or replica unless the underlying LINSTOR cluster is already provisioned. The CSI driver does not have any reactive capability.

Even w/ the current operator, volume placement on creation is at-best-effort w/o guarantee as this is the level of service provided by the CSI framework and the k8s. We are working toward resolving this issue at the operator in the long term, but I'm afraid this is not possible at this time.

@azalio
Copy link
Author

azalio commented Feb 11, 2020

Thank you for reply!
As I understand, If my server died I will need do manual action and linstor operator can't help me with it?

@w00jay
Copy link
Contributor

w00jay commented Feb 11, 2020

If a 'server dies' as in 'a k8s node fails,' AND does not come back, the current implementation of k8s and CSI framework most likely cannot deal w/ it very well. Most likely a statefulset controller cannot be sure if the node will ever come back, and will never drop the connection.

Our operator at this stage, is only concerned w/ registration of new LINSTOR storage nodes and deploying new PVs onto those nodes. An attached storage on a failed node even with the operator, cannot be moved to a new storage node w/o additional intervention in a k8s controller logic which does not exist in the LINSTOR operator at this time.

@kvaps
Copy link
Member

kvaps commented Mar 23, 2020

I think this issue is more likely about auto-recovering feature for the linstor-controller, so better to move it to proper project:

https://github.com/LINBIT/linstor-server

@kvaps
Copy link
Member

kvaps commented Oct 10, 2020

I guess this issue was fixed by implementing k8s-await-election for the linstor-controller, see piraeusdatastore/piraeus-operator#56 and piraeusdatastore/piraeus-operator#73

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants