Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-stalled/wal-failover/among-stores failed #124399

Open
cockroach-teamcity opened this issue May 19, 2024 · 10 comments
Open

roachtest: disk-stalled/wal-failover/among-stores failed #124399

cockroach-teamcity opened this issue May 19, 2024 · 10 comments
Assignees
Labels
branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 19, 2024

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.214567s at 2024-05-19T13:43:00Z
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-38872

@cockroach-teamcity cockroach-teamcity added branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-storage Storage Team labels May 19, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone May 19, 2024
@cockroach-teamcity cockroach-teamcity added this to Incoming in Storage May 19, 2024
@sumeerbhola
Copy link
Collaborator

The following disk stall caused the problem
13:42:42 disk_stall.go:128: test status: pausing 18.999452787s before next simulated disk stall on n1
13:43:01 cluster.go:2369: running cmd sudo dmsetup suspend --nofl... on nodes [:1]; details in run_134301.673602484_n1_sudo-dmsetup-suspend.log
13:43:32 cluster.go:2369: running cmd sudo dmsetup resume data1 on nodes [:1]; details in run_134332.310261019_n1_sudo-dmsetup-resume-.log
13:43:42 disk_stall.go:128: test status: pausing 9m18.999550257s before next simulated disk stall on n1

The pmax of observed WAL fsync latency is not unexpectedly high.
Screenshot 2024-05-19 at 5 11 26 PM

Unlike #122364, this is not encrypted FS.
KV p99 is 10+s and n1 lost its leases:
Screenshot 2024-05-19 at 5 14 44 PM

Screenshot 2024-05-19 at 5 15 52 PM

@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z
(cluster.go:2349).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@sumeerbhola
Copy link
Collaborator

The p99 goroutine scheduling latency is low, and the number of runnable goroutines is also low, but we have slot exhaustion. Which must be a side-effect of request processing being stuck. We don't have latency histograms for the storage read path, which would allow us to narrow down.
Screenshot 2024-05-21 at 1 05 16 PM
Screenshot 2024-05-21 at 1 05 53 PM
Screenshot 2024-05-21 at 1 06 03 PM

@sumeerbhola
Copy link
Collaborator

We could enhance the test to monitor the used slots on n1 and if it exceeds 500, take a goroutine dump, so we know where those goroutines are stuck.

@sumeerbhola
Copy link
Collaborator

Never mind. We have a dump.

@sumeerbhola sumeerbhola self-assigned this May 21, 2024
@RaduBerinde RaduBerinde moved this from Incoming to Tests (failures, skipped, flakes) in Storage May 21, 2024
@sumeerbhola
Copy link
Collaborator

There are > 2500 goroutines stuck waiting on [1]

<-v.loaded

in tableCacheShard, so they are waiting for the cache value (*Reader etc.) to be loaded from disk.
There are 3 goroutines that are trying to do the load

goroutine 2557115 [runnable]:
runtime/pprof.Do({0x7d2ee08?, 0xca9b180?}, {{0xc00138ed00?, 0x0?, 0x0?}}, 0xc00a4d82e8)
	GOROOT/src/runtime/pprof/runtime.go:52 +0xad
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x603b?}, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:931 +0x43c
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c

Which is this code
https://github.com/cockroachdb/pebble/blob/c077435aef13d640fc599ad01d10455f68510d1a/table_cache.go#L931-L933

But they are in the runtime package here https://github.com/golang/go/blob/master/src/runtime/pprof/runtime.go#L47-L52 which (since this is at the closing brace of the function) looks like defer SetGoroutineLabels(ctx), which calls into https://github.com/golang/go/blob/master/src/runtime/proflabel.go#L12-L35. Unclear why this should be stuck here.
[1]

goroutine 2942846 [runnable]:
github.com/cockroachdb/pebble.(*tableCacheShard).findNodeInternal(0xc0036446e0, {0x6a03?}, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:862 +0x10f
github.com/cockroachdb/pebble.(*tableCacheShard).findNode(0xc0036446e0, 0xc002014840, 0xc00175b958)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:844 +0x3c
github.com/cockroachdb/pebble.(*tableCacheShard).newIters(0xc0036446e0, {0x7d30150, 0xc0e3361a70}, 0xc03fc99a00, 0xc0e301b170, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:505 +0x97
github.com/cockroachdb/pebble.(*tableCacheContainer).newIters(0x7f82ead0a064?, {0x7d30150?, 0xc0e3361a70?}, 0x115a6ca?, 0xc00f372b40?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}}, ...)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:219 +0x9e
github.com/cockroachdb/pebble.tableNewIters.TODO(0xd?, {0x7d30150?, 0xc0e3361a70?}, 0x125b0e5?, 0xc0e301b2d8?, {0x0, 0x0, 0xc0e301a9d0, {0x0, 0x0}})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:90 +0x6e
github.com/cockroachdb/pebble.(*levelIter).loadFile(0xc0e301b108, 0xc03fc99a00, 0x1)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:647 +0x2d1
github.com/cockroachdb/pebble.(*levelIter).SeekPrefixGE(0xc0e301b108, {0xc0a5d23140, 0xd, 0xd}, {0xc0a5d23670, 0xd, 0xd}, 0x0)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/level_iter.go:725 +0xb7
github.com/cockroachdb/pebble.(*mergingIter).seekGE(0xc0e301ac00, {0xc0a5d23670?, 0x0?, 0x17d0e7acb5255d37?}, 0x0?, 0x0)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1035 +0xf1
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGEStrict(0xc0e301ac00, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0x1000121ebcf?, 0x7f80187fa960?}, 0x10?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1113 +0x65
github.com/cockroachdb/pebble.(*mergingIter).SeekPrefixGE(0xc01993b008?, {0xc0a5d23140?, 0x0?, 0x0?}, {0xc0a5d23670?, 0xc008c0e9b8?, 0x4b6789?}, 0xd?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/merging_iter.go:1104 +0x1d
github.com/cockroachdb/pebble.(*lazyCombinedIter).SeekPrefixGE(0xc0e301aa68, {0xc0a5d23140?, 0x0?, 0xd?}, {0xc0a5d23670, 0xd, 0xd}, 0x10?)
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/range_keys.go:617 +0x57
github.com/cockroachdb/pebble.(*Iterator).SeekPrefixGE(0xc0e301a608, {0xc0a5d23670, 0xd, 0xd})
	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/iterator.go:1492 +0x564
github.com/cockroachdb/cockroach/pkg/storage.(*pebbleIterator).SeekGE(0xc08ae59858, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}})
	github.com/cockroachdb/cockroach/pkg/storage/pebble_iterator.go:373 +0xc5
github.com/cockroachdb/cockroach/pkg/storage.mvccGetMetadata({0x7db7b60, 0xc08ae59858}, {{0xc0a5d23640, 0xc, 0x10}, {0x0, 0x0}}, 0xc0e30db440)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1656 +0xa5
github.com/cockroachdb/cockroach/pkg/storage.mvccPutInternal({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2277 +0x3b4
github.com/cockroachdb/cockroach/pkg/storage.mvccPutUsingIter({_, _}, {_, _}, {_, _}, _, {0xc0a5d23640, 0xc, 0x10}, ...)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:2032 +0x1f6
github.com/cockroachdb/cockroach/pkg/storage.MVCCPut({_, _}, {_, _}, {_, _, _}, {_, _}, {{0xc0e3428900, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/storage/mvcc.go:1939 +0x494
github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval.Put({_, _}, {_, _}, {{0x7dd6d08, 0xc0e33070c0}, {{0x17d0e7afce337fe9, 0x0}, 0x0, {0x17d0e7afce378f6f, ...}, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/batcheval/cmd_put.go:74 +0x391
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateCommand({_, _}, {_, _}, {_, _}, _, _, {{0x17d0e7afce337fe9, 0x0}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:488 +0x3a9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.evaluateBatch({_, _}, {_, _}, {_, _}, {_, _}, _, 0xc0e33698c0, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_evaluate.go:305 +0xa5d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWrapper(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:736 +0x1d6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluateWriteBatchWithServersideRefreshes(_, {_, _}, {_, _}, {_, _}, _, _, 0xc08b9b1860, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_write.go:703 +0x359
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).evaluate1PC(_, {_, _}, {_, _}, _, _, _)

@sumeerbhola
Copy link
Collaborator

@sumeerbhola
Copy link
Collaborator

The test failure in #124399 (comment) does not have a goroutine dump from the stall corresponding to the failure. I have not yet looked at the metrics.

14:24:38 test_runner.go:1098: ##teamcity[testFailed name='disk-stalled/wal-failover/among-stores' details='(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.19715065s at 2024-05-20T13:29:00Z|

@sumeerbhola
Copy link
Collaborator

I am going to remove the release blocker label since this is a rare case where WAL failover is not mitigating SQL-level latency, but narrowly speaking, the observed high latency is on the read path, so WAL failover itself is working. We will continue investigating, but this is not a release blocker since the same would occur on a block cache + page cache miss for a read.

@sumeerbhola sumeerbhola removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 28, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 9cbd031ecc99039507957a6bbc273a4da6775397:

(disk_stall.go:174).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.166684829s at 2024-05-28T14:05:00Z
(cluster.go:2398).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.1 Used to mark GA and release blockers and technical advisories for 24.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
Storage
  
Tests (failures, skipped, flakes)
Development

No branches or pull requests

2 participants