Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumberFormatExceptions followed by "snapshot name is in use for volume ..." #174

Open
blampe opened this issue Aug 29, 2022 · 3 comments
Open

Comments

@blampe
Copy link

blampe commented Aug 29, 2022

I'm trying to setup snapshot shipping via CSI (#142) without much success. Most recently I had a flood of errors around parse failures, but there's not enough detail to figure out what it's failing to parse. I thought maybe it had something to do with my VolumeClassName ("linstor.minio") having a period in the name maybe throwing off parsing, but I was able to create another backup (from a significantly smaller volume) using the same class. This all might be a red herring...

It seems after these NumberFormatExceptions happen the system gets into a mixed up state as you can see below -- namely it tries to re-assign an existing snapshot name to a different PVC.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  creationTimestamp: "2022-08-29T08:31:01Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  - snapshot.storage.kubernetes.io/volumesnapshot-bound-protection
  generateName: velero-vmsingle-vm-victoria-metrics-k8s-stack-
  generation: 1
  labels:
    velero.io/backup-name: linstor-20220829080004
  name: velero-vmsingle-vm-victoria-metrics-k8s-stack-844m6
  namespace: prometheus
  resourceVersion: "56560018"
  uid: 17660871-6cf4-4d07-8c11-8c9f5922625a
spec:
  source:
    persistentVolumeClaimName: vmsingle-vm-victoria-metrics-k8s-stack
  volumeSnapshotClassName: linstor.minio
status:
  boundVolumeSnapshotContentName: snapcontent-17660871-6cf4-4d07-8c11-8c9f5922625a
  error:
    message: 'Failed to check and update snapshot content: failed to take snapshot
      of the volume pvc-7ca5e665-9bd1-4f9d-8209-92dcb8474edf: "rpc error: code = AlreadyExists
      desc = can''t use \"snapshot-17660871-6cf4-4d07-8c11-8c9f5922625a\" for snapshot
      name for volume \"pvc-7ca5e665-9bd1-4f9d-8209-92dcb8474edf\", snapshot name
      is in use for volume \"pvc-157c3fce-5720-4ca0-ba11-e2cab638ea93\""'
    time: "2022-08-29T14:58:59Z"
  readyToUse: false

pvc-157c3fce-5720-4ca0-ba11-e2cab638ea93 is indeed another volume that was snapshotted around the same time as this one. Is there perhaps a race condition where creating multiple snapshots around the same time can cause names to not be unique?

# linstor err l
| 630C4D06-00000-000045 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000046 | 2022-08-29 13:29:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000047 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000048 | 2022-08-29 13:30:30 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000049 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000050 | 2022-08-29 13:31:19 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000051 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000052 | 2022-08-29 13:38:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000053 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000054 | 2022-08-29 13:53:45 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000055 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000056 | 2022-08-29 14:00:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000057 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000058 | 2022-08-29 14:04:54 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000059 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000060 | 2022-08-29 14:08:40 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000061 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000062 | 2022-08-29 14:08:41 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000063 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000064 | 2022-08-29 14:16:47 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000065 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000066 | 2022-08-29 14:21:52 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000067 | 2022-08-29 14:26:04 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: For input string: ".2022022E4.2022022E4"              |
| 630C4D06-00000-000068 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000069 | 2022-08-29 14:28:44 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000070 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000071 | 2022-08-29 14:30:09 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000072 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000073 | 2022-08-29 14:31:58 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000074 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
| 630C4D06-00000-000075 | 2022-08-29 14:44:20 | C|piraeus-op-cs-controller-5b8ff7f4c8-lfbk2 | NumberFormatException: multiple points                                       |
# linstor err show (multiple points)
ERROR REPORT 630C4D06-00000-000066

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.19.1
Build ID:                           a758bf07796c374fd2004465b0d8690209b74356
Build time:                         2022-07-27T06:36:54+00:00
Error time:                         2022-08-29 14:21:52
Node:                               piraeus-op-cs-controller-5b8ff7f4c8-lfbk2

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NumberFormatException
Class canonical name:               java.lang.NumberFormatException
Generated at:                       Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #1890

Error message:                      multiple points

Call backtrace:

    Method                                   Native Class:Line number
    readJavaFormatString                     N      jdk.internal.math.FloatingDecimal:1890
    parseDouble                              N      jdk.internal.math.FloatingDecimal:110
    parseDouble                              N      java.lang.Double:543
    getDouble                                N      java.text.DigitList:169
    parse                                    N      java.text.DecimalFormat:2126
    subParse                                 N      java.text.SimpleDateFormat:1931
    parse                                    N      java.text.SimpleDateFormat:1541
    parse                                    N      java.text.DateFormat:393
    <init>                                   N      com.linbit.linstor.backupshipping.S3MetafileNameInfo:46
    listBackups                              N      com.linbit.linstor.core.apicallhandler.controller.backup.CtrlBackupApiCallHandler:679
    lambda$listBackups$0                     N      com.linbit.linstor.api.rest.v1.Backups:304
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:231
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:203
    listBackups                              N      com.linbit.linstor.api.rest.v1.Backups:298
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor194:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:566
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:469
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:391
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:80
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:253
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:232
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:680
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:200
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.
# linstor err show (For input string)
ERROR REPORT 630C4D06-00000-000067

============================================================

Application:                        LINBIT�� LINSTOR
Module:                             Controller
Version:                            1.19.1
Build ID:                           a758bf07796c374fd2004465b0d8690209b74356
Build time:                         2022-07-27T06:36:54+00:00
Error time:                         2022-08-29 14:26:04
Node:                               piraeus-op-cs-controller-5b8ff7f4c8-lfbk2

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NumberFormatException
Class canonical name:               java.lang.NumberFormatException
Generated at:                       Method 'readJavaFormatString', Source file 'FloatingDecimal.java', Line #2054

Error message:                      For input string: ".2022022E4.2022022E4"

Call backtrace:

    Method                                   Native Class:Line number
    readJavaFormatString                     N      jdk.internal.math.FloatingDecimal:2054
    parseDouble                              N      jdk.internal.math.FloatingDecimal:110
    parseDouble                              N      java.lang.Double:543
    getDouble                                N      java.text.DigitList:169
    parse                                    N      java.text.DecimalFormat:2126
    subParse                                 N      java.text.SimpleDateFormat:1931
    parse                                    N      java.text.SimpleDateFormat:1541
    parse                                    N      java.text.DateFormat:393
    <init>                                   N      com.linbit.linstor.backupshipping.S3MetafileNameInfo:46
    listBackups                              N      com.linbit.linstor.core.apicallhandler.controller.backup.CtrlBackupApiCallHandler:679
    lambda$listBackups$0                     N      com.linbit.linstor.api.rest.v1.Backups:304
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:231
    doInScope                                N      com.linbit.linstor.api.rest.v1.RequestHelper:203
    listBackups                              N      com.linbit.linstor.api.rest.v1.Backups:298
    invoke                                   N      jdk.internal.reflect.GeneratedMethodAccessor194:unknown
    invoke                                   N      jdk.internal.reflect.DelegatingMethodAccessorImpl:43
    invoke                                   N      java.lang.reflect.Method:566
    lambda$static$0                          N      org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory:52
    run                                      N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1:124
    invoke                                   N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:167
    doDispatch                               N      org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker:176
    dispatch                                 N      org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher:79
    invoke                                   N      org.glassfish.jersey.server.model.ResourceMethodInvoker:469
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:391
    apply                                    N      org.glassfish.jersey.server.model.ResourceMethodInvoker:80
    run                                      N      org.glassfish.jersey.server.ServerRuntime$1:253
    call                                     N      org.glassfish.jersey.internal.Errors$1:248
    call                                     N      org.glassfish.jersey.internal.Errors$1:244
    process                                  N      org.glassfish.jersey.internal.Errors:292
    process                                  N      org.glassfish.jersey.internal.Errors:274
    process                                  N      org.glassfish.jersey.internal.Errors:244
    runInScope                               N      org.glassfish.jersey.process.internal.RequestScope:265
    process                                  N      org.glassfish.jersey.server.ServerRuntime:232
    handle                                   N      org.glassfish.jersey.server.ApplicationHandler:680
    service                                  N      org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer:356
    run                                      N      org.glassfish.grizzly.http.server.HttpHandler$1:200
    doWork                                   N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:569
    run                                      N      org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker:549
    run                                      N      java.lang.Thread:829


END OF ERROR REPORT.
@WanzenBug
Copy link
Member

Weird. Since the error seems to be in listBackups (which gets called by the CSI driver on creation) I assume it's not actually about any snapshot it tries to create, but about what files already exist in the bucket. This looks like an issue that needs to be fixed upstream: https://github.com/LINBIT/linstor-server/issues

@blampe
Copy link
Author

blampe commented Sep 2, 2022

Running with --log-level=debug shed a little more light on this...

time="2022-09-02T02:10:03Z" level=debug msg="no snapshot matching id found, trying backups" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1367" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/s3'"
time="2022-09-02T02:10:03Z" level=debug msg="listing backups in remote" func="github.com/piraeusdatastore/linstor-csi/pkg/client.(*Linstor).snapOrBackupById" file="/src/pkg/client/linstor.go:1377" id=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba linstorCSIComponent=client remote=minio
time="2022-09-02T02:10:03Z" level=debug msg="curl -X 'GET' -H 'Accept: application/json' 'https://piraeus-op-cs.piraeus.svc:3371/v1/remotes/minio/backups?snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba'"

time="2022-09-02T02:10:03Z" level=debug msg="found existing snapshot" func=github.com/piraeusdatastore/linstor-csi/pkg/driver.Driver.CreateSnapshot file="/src/pkg/driver/driver.go:965" existingSnapshot="snapshot_id:\"snapshot-dc295533-ba78-4e3f-b110-54175f1386f1\" source_volume_id:\"pvc-621a06d8-0503-4cf3-84af-d75e1bf40508\" creation_time:<seconds:1662034561 > " linstorCSIComponent=driver nodeID=pi4c provisioner=linstor.csi.linbit.com requestedSnapshotName=snapshot-1a9289d2-b984-4527-baf4-522b42908e0e requestedSnapshotSourceVolume=pvc-d2ec0c0d-f114-4fe2-9c30-a56279e356b6 version=v0.20.0-d514e41db7cdcb580769cc69f1c1ef2b8a5def5d

That last line is interesting... It queries for a backup with snap_name=snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba and it gets a result... but that result is a totally different one (snapshot-dc295533-ba78-4e3f-b110-54175f1386f1) than the one it asked for?

Furthermore the snapshot returned is unrecoverable. That should at least unblock me (I can delete it), although I'm surprised it wasn't deleted/cleaned up automatically (it was totally hidden in the CLI, which is as good as deleted, anyway).

root@piraeus-op-cs-controller-5b8ff7f4c8-nqns5:/# linstor backup l minio -s snapshot-6b4f6adb-7a82-4d2f-a72c-2bc987d70dba -p
+------------------------------------------------------------------------------------------------------------------------------------+
| Resource                                 | Snapshot                                      | Finished at | Based On | Status         |
|====================================================================================================================================|
| pvc-621a06d8-0503-4cf3-84af-d75e1bf40508 | snapshot-dc295533-ba78-4e3f-b110-54175f1386f1 |             |          | Not restorable |
+------------------------------------------------------------------------------------------------------------------------------------+

What's still confusing me (and where there's maybe room for the CSI to improve) is why this UUID keeps getting recycled for so many different volumes. I know very little about CSI, but naively it seems like each snapshot request should have its own UUID, making collisions like this impossible. The only way I can see that happening is if CompatibleSnapshotId gets the same invalid name for multiple snapshots... some of my PVCs do have identical names, if that's relevant? Is there any downside to adding entropy to the fallback ID that's generated?

@blampe
Copy link
Author

blampe commented Sep 18, 2022

@WanzenBug take a look at LINBIT/linstor-server#314 -- the presence of any backups in the cluster completely breaks snapshotting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants