-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: disk-stalled/wal-failover/among-stores failed #124399
Comments
The following disk stall caused the problem The pmax of observed WAL fsync latency is not unexpectedly high. Unlike #122364, this is not encrypted FS. |
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:
Parameters:
|
We could enhance the test to monitor the used slots on n1 and if it exceeds 500, take a goroutine dump, so we know where those goroutines are stuck. |
Never mind. We have a dump. |
There are > 2500 goroutines stuck waiting on [1]
in
Which is this code But they are in the runtime package here https://github.com/golang/go/blob/master/src/runtime/pprof/runtime.go#L47-L52 which (since this is at the closing brace of the function) looks like
|
The test failure in #124399 (comment) does not have a goroutine dump from the stall corresponding to the failure. I have not yet looked at the metrics.
|
I am going to remove the release blocker label since this is a rare case where WAL failover is not mitigating SQL-level latency, but narrowly speaking, the observed high latency is on the read path, so WAL failover itself is working. We will continue investigating, but this is not a release blocker since the same would occur on a block cache + page cache miss for a read. |
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 9cbd031ecc99039507957a6bbc273a4da6775397:
Parameters:
|
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on release-24.1 @ 2a21984e2fd9b8aff8fc8bd5c9d80785168daf71:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=2
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-38872
The text was updated successfully, but these errors were encountered: