server: TestDatabasesTablesV2 failed #124395

cockroach-teamcity · 2024-05-19T11:35:35Z

server.TestDatabasesTablesV2 failed on master @ 93ad913106b6f0f6ec98bc2cfa788ff6d8085bd4:

=== RUN   TestDatabasesTablesV2
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestDatabasesTablesV23023245521
    test_log_scope.go:81: use -show-logs to present logs inline
    test_server_shim.go:144: cluster virtualization disabled in global scope due to issue: #76378 (expected label: C-bug)
    api_v2_sql_schema_test.go:148: 
        	Error Trace:	github.com/cockroachdb/cockroach/pkg/server/api_v2_sql_schema_test.go:148
        	Error:      	Received unexpected error:
        	            	Get "https://127.0.0.1:33161/api/v2/databases/testdb/tables/public.testtable/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestDatabasesTablesV2
    panic.go:626: -- test log scope end --
test logs left over in: outputs.zip/logTestDatabasesTablesV23023245521
--- FAIL: TestDatabasesTablesV2 (54.20s)

Parameters:

attempt=1
race=true
run=2
shard=5

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/obs-prs @cockroachdb/server _{This test on roachdash | Improve this report!

Jira issue: CRDB-38868}

The text was updated successfully, but these errors were encountered:

abarganier · 2024-05-20T14:10:50Z

This failed last week here, which I chalked up to a flake.

Given the newest failures, I'm going to give this a closer look.

abarganier · 2024-05-20T18:07:07Z

I was able to reproduce locally under --stress --race.

From what I can tell, the test cluster/server is receiving signals to shutdown before the test completes. The specific call that's failing alone does not seem to be problematic.

It seems that this query is failing to finish executing. When we attempt to read through the result set, it.Next() returns the error:

‹admin-show-grants›: failed to read query result: query execution canceled

Looking at the timestamps in the logs, quite a bit of time passes from the time the test cluster is initialized, until the TableDetails endpoint is invoked. However, the amount of time spent within the TableDetails endpoint before the test cluster begins shutting down is only ~10 seconds, which should not trigger the request timeout.

17:51:04 - Test cluster initialized
17:51:07 - First indication of the test interacting with the test cluster (
17:52:53 - TableDetails endpoint request received
17:53:01 - Endpoint runs admin-show-grants query
17:53:03.417 - "TestCluster quiescing nodes"
17:53:03.744 - admin-show-grants query returns
17:53:03.785 - Attempting to iterate result set fails

Seems like the endpoint itself is responding in an acceptable amount of time. However, the amount of time it takes for the test to reach the point where it makes a request seems to be too long.

Plenty of things happen between the test's first interaction with the test cluster and the timeout during the specific call to TableDetails. I'm going to see if I can pinpoint the portion of the test that's taking so long.

abarganier · 2024-05-20T18:52:54Z

I added some timers to various parts of the test.

api_v2_sql_schema_test.go:85: Table & DB creation time - start: 18:42:56.787315, duration 22.221347291s
api_v2_sql_schema_test.go:97: /databases/ call - start: 18:43:37.780579, duration 1.719976542s
api_v2_sql_schema_test.go:112: /databases/testdb/ call - start: 18:43:39.50957, duration 3.060485459s
api_v2_sql_schema_test.go:125: /databases/testdb/grants/ call - start: 18:43:42.642032, duration 1.945669334s
api_v2_sql_schema_test.go:139: /databases/testdb/tables/ call - start: 18:43:44.588601, duration 2.728289625s
api_v2_sql_schema_test.go:156: /databases/testdb2/tables/ call (should 404) - start: 18:43:47.317876, duration 515.812625ms

...and the cluster log indicating that the cluster has received a signal to shut down:

I240520 18:43:57.957227 51 testutils/testcluster/testcluster.go:148 ⋮ [-] 1191  TestCluster quiescing nodes

22 seconds for creation of DBs/tables/users/roles required by the test feels a bit long, but is that the true problem?

From the first moment the test begins to interact with the test cluster until the test cluster receives a signal to shut down, only around ~1 minute passes.

I'm going to look into the default timeouts and see if this is expected, and/or if we need to increase the test timeout.

dhartunian · 2024-05-21T23:11:04Z

@abarganier I think it's significant that the test fails right after a test that checks for a 404 Not Found response. Is it possible that the helper we use to retrieve table data leaks resources in the error case? That could cause this odd error result.

This is the handler which looks innocuous:

cockroach/pkg/server/api_v2_sql_schema.go

Lines 477 to 497 in c5e3cf4

    
           func (a *apiV2Server) tableDetails(w http.ResponseWriter, r *http.Request) { 
        
           	ctx := r.Context() 
        
           	username := authserver.UserFromHTTPAuthInfoContext(ctx) 
        
           	ctx = a.sqlServer.AnnotateCtx(ctx) 
        
           	pathVars := mux.Vars(r) 
        
           	req := &serverpb.TableDetailsRequest{ 
        
           		Database: pathVars["database_name"], 
        
           		Table:    pathVars["table_name"], 
        
           	} 
        
           	resp, err := a.admin.tableDetailsHelper(ctx, req, username) 
        
           	if err != nil { 
        
           		if status.Code(err) == codes.NotFound || isNotFoundError(err) { 
        
           			http.Error(w, "database or table not found", http.StatusNotFound) 
        
           		} else { 
        
           			srverrors.APIV2InternalError(ctx, err, w) 
        
           		} 
        
           		return 
        
           	} 
        
           	apiutil.WriteJSONResponse(ctx, w, 200, tableDetailsResponse(*resp)) 
        
           }

But the tableDetailsHelper is quite hefty and has all sorts of stuff going on.

Previously, we'd use a 10s timeout on HTTP clients for all test runs. This change modifies this to be `40s` under race. Resolves cockroachdb#124395 Resolves cockroachdb#122892 Release note: None <pkg>: <short description - lowercase, no final period> <what was there before: Previously, ...> <why it needed to change: This was inadequate because ...> <what you did about it: To address this, this patch ...>

Previously, we'd use a 10s timeout on HTTP clients for all test runs. This change modifies this to be `40s` under race. Resolves cockroachdb#124395 Resolves cockroachdb#122892 Release note: None

124579: server: increase test http client timeouts under race r=kyle-a-wong a=dhartunian Previously, we'd use a 10s timeout on HTTP clients for all test runs. This change modifies this to be `30s` under race. Resolves #124395 Resolves #122892 Release note: None Co-authored-by: David Hartunian <davidh@cockroachlabs.com>

Previously, we'd use a 10s timeout on HTTP clients for all test runs. This change modifies this to be `30s` under race. Resolves #124395 Resolves #122892 Release note: None

cockroach-teamcity mentioned this issue May 20, 2024

server: TestDatabasesTablesV2 failed #124420

Closed

abarganier self-assigned this May 20, 2024

exalate-issue-sync bot assigned dhartunian and unassigned abarganier May 22, 2024

dhartunian added P-2 Issues/test failures with a fix SLA of 3 months and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 22, 2024

dhartunian mentioned this issue May 22, 2024

server: increase test http client timeouts under race #124579

Merged

craig bot closed this as completed in 8c99abc May 23, 2024

blathers-crl bot mentioned this issue May 23, 2024

release-24.1: server: increase test http client timeouts under race #124620

Merged

blathers-crl bot mentioned this issue May 23, 2024

release-23.2: server: increase test http client timeouts under race #124621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: TestDatabasesTablesV2 failed #124395

server: TestDatabasesTablesV2 failed #124395

cockroach-teamcity commented May 19, 2024 •

edited by cockroach-jira-scripts

abarganier commented May 20, 2024

abarganier commented May 20, 2024 •

edited

abarganier commented May 20, 2024

dhartunian commented May 21, 2024

server: TestDatabasesTablesV2 failed #124395

server: TestDatabasesTablesV2 failed #124395

Comments

cockroach-teamcity commented May 19, 2024 • edited by cockroach-jira-scripts

abarganier commented May 20, 2024

abarganier commented May 20, 2024 • edited

abarganier commented May 20, 2024

dhartunian commented May 21, 2024

cockroach-teamcity commented May 19, 2024 •

edited by cockroach-jira-scripts

abarganier commented May 20, 2024 •

edited