sql: correct error when logging timed out and cancelled queries #124371

xinhaoz · 2024-05-17T22:15:02Z

Note to reviewers: only the latest commit should be reviewed in this PR.

Setting a query timeout or cancellation error occurs in execStmtInOpenState,
while logging occurs in dispatchToExecutionEngine (called on by the former).
This previously resulted in the incorrect pg error code being surfaced
when logging cancelled queries. This commit moves statement logging
from dispatchToExecutionEngine to execStmtInOpenState so that we have the
correct error information at the time of logging.

Preserving pausible portal logging behaviour:
There are some nuances in logging when we have a pausable portal that
must be preserved. For pausible portals, logging was triggered as a cleanup
action in execStmtInOpenState by appending it to a list of cleanup fns
in dispatch. This is now accomplished by performing logging in the
processCleanupFunc helper within execStmtInOpenState.

Epic: none
Fixes: #122722

Release note (bug fix): Fixes a bug in logging where error code was misreported
for cancelled queries. Affected channels: SQL_PERF (slow query logs), SQL_EXEC
(sql exec logs).

blathers-crl · 2024-05-17T22:15:06Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-05-17T22:15:15Z

This change is

dhartunian

I'm continuing to think through this but publishing a few initial thoughts.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kyle-a-wong and @xinhaoz)

pkg/sql/conn_executor_exec.go line 1869 at r3 (raw file):

				bulkJobId = res.GetBulkJobId()
			}
			planner.maybeLogStatement(

can you explain how removing this maybeLogStatement instance doesn't reduce logging in certain scenarios? I assume it's here to capture scenarios that the one in line 853 does not.

pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

			}

			p.maybeLogStatement(

Maybe it will help to discuss what's going on in person, but is it possible and perhaps less disruptive to move the location of maybeLogStatement instead of moving the error rewriting logic into this defer?

pkg/sql/conn_executor_exec.go line 983 at r4 (raw file):

	}

	dispatchToExecEngine = true

why is this here?

xinhaoz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @kyle-a-wong)

pkg/sql/conn_executor_exec.go line 1869 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can you explain how removing this maybeLogStatement instance doesn't reduce logging in certain scenarios? I assume it's here to capture scenarios that the one in line 853 does not.

Hmm I'll try that. I thought it might be important to preserve where logging was occuring in the defer / cleanup process but maybe moving it further down in the defer chain might just work 🤔 It'll be cleaner if we can keep the error setting where it currently is.

pkg/sql/conn_executor_exec.go line 983 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

why is this here?

Added a comment for this at declaration. The previous behaviour here was to log for the admin audit log if we did not dispatch to the exec engine. I'm trying to preserve what seems like was 'special case' behaviour for logging in that case, where we needed to init the plan in a particular way.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @kyle-a-wong)

pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Maybe it will help to discuss what's going on in person, but is it possible and perhaps less disruptive to move the location of maybeLogStatement instead of moving the error rewriting logic into this defer?

on the main issue, i had a similar instinct: #122722 (comment)

i feel like it would be less disruptive, which is especially relevant here since we'd want this fix backported. (so it should be the smallest diff we can achieve.) i'm happy to discuss the challenges with that here or in person.

xinhaoz · 2024-05-24T15:15:23Z

pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

on the main issue, i had a similar instinct: #122722 (comment)

i feel like it would be less disruptive, which is especially relevant here since we'd want this fix backported. (so it should be the smallest diff we can achieve.) i'm happy to discuss the challenges with that here or in person.

@rafiss Ah yeah I ended up doing both here (moving the logging to execStmt and also moving the error setting logic). Logging already existed in execStmtInOpenState for statements that don't get dispatched to the execution engine. I was not sure if the order of operations in execStmtInOpenState was significant w.r.t logging so I kept logging where it currently was and moved the error writing there. Perhaps moving the logging to happen after query cancellation is innocuous though, and then we can keep the error writing logic as is. I'm going to try that and see if anything breaks :)

xinhaoz · 2024-05-28T14:18:48Z

pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

@rafiss Ah yeah I ended up doing both here (moving the logging to execStmt and also moving the error setting logic). Logging already existed in execStmtInOpenState for statements that don't get dispatched to the execution engine. I was not sure if the order of operations in execStmtInOpenState was significant w.r.t logging so I kept logging where it currently was and moved the error writing there. Perhaps moving the logging to happen after query cancellation is innocuous though, and then we can keep the error writing logic as is. I'm going to try that and see if anything breaks :)

There are a few things that make moving logging within execStmtInOpenState to occur after the error setting more difficult than the current solution of moving the error setting to the logging.

Currently the order of operations is (omitting ops irrelevant to this situation): log, finish instrumentation helper (which finishes the tracing span used for some logging operations), set the query cancellation and timeout errors + cancel the query (these last 2 things happen in the same deferred function).

The order above needs to be preserved as far as I've been able to tell since logging relies on the I.H. not being cleaned up, and parts of the instrumentation helper cleanup rely on the query's context still being active.

It seems to me like it's easier to move the error setting logic, but I might be missing some context - @rafiss is there a specific reason it needs to occur right before query cancellation? I also noticed setting a query timeout error is not in the processCleanupFn for pausible portals - what's important there in terms of the order of operations that need to be preserved for pausible portals? I've currently moved it to be within the processCleanupFn for logging but can mirror the old behaviour with some additional comments if given the context.

Setting a query timeout or cancellation error occurs in execStmtInOpenState, while logging occurs in dispatchToExecutionEngine (called on by the former). This previously resulted in the incorrect pg error code being surfaced when logging cancelled queries. This commit moves statement logging from dispatchToExecutionEngine to execStmtInOpenState so that we have the correct error information at the time of logging. Preserving pausible portal logging behaviour: There are some nuances in logging when we have a pausable portal that must be preserved. For pausible portals, logging was triggered as a cleanup action in `execStmtInOpenState` by appending it to a list of cleanup fns in `dispatch`. This is now accomplished by performing logging in the `processCleanupFunc` helper within `execStmtInOpenState`. Epic: none Fixes: cockroachdb#122722 Release note (bug fix): Fixes a bug in logging where error code was misreported for cancelled queries. Affected channels: SQL_PERF (slow query logs), SQL_EXEC (sql exec logs).

xinhaoz changed the title ~~logtestutils: generalize structured logging spy~~ sql: correct error when logging timed out and cancelled queries May 17, 2024

xinhaoz force-pushed the move-logging-from-dispatch branch 2 times, most recently from 298f0dd to 58d99cd Compare May 21, 2024 13:44

xinhaoz marked this pull request as ready for review May 21, 2024 13:44

xinhaoz requested a review from a team as a code owner May 21, 2024 13:44

xinhaoz requested review from kyle-a-wong, a team and dhartunian and removed request for a team May 21, 2024 13:44

dhartunian reviewed May 21, 2024

View reviewed changes

xinhaoz force-pushed the move-logging-from-dispatch branch from 58d99cd to db0b2af Compare May 21, 2024 15:54

xinhaoz commented May 21, 2024

View reviewed changes

xinhaoz force-pushed the move-logging-from-dispatch branch from db0b2af to fcc9e83 Compare May 21, 2024 17:29

xinhaoz requested a review from rafiss May 21, 2024 17:45

rafiss reviewed May 22, 2024

View reviewed changes

xinhaoz added 2 commits May 29, 2024 16:33

sql: move HasAdminRole check for logging

f2e86d0

xinhaoz force-pushed the move-logging-from-dispatch branch from fcc9e83 to f2e86d0 Compare May 29, 2024 20:34

xinhaoz requested a review from rafiss May 29, 2024 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: correct error when logging timed out and cancelled queries #124371

sql: correct error when logging timed out and cancelled queries #124371

xinhaoz commented May 17, 2024 •

edited

blathers-crl bot commented May 17, 2024

cockroach-teamcity commented May 17, 2024

dhartunian left a comment

xinhaoz left a comment

rafiss left a comment

xinhaoz commented May 24, 2024

xinhaoz commented May 28, 2024

sql: correct error when logging timed out and cancelled queries #124371

Are you sure you want to change the base?

sql: correct error when logging timed out and cancelled queries #124371

Conversation

xinhaoz commented May 17, 2024 • edited

Note to reviewers: only the latest commit should be reviewed in this PR.

blathers-crl bot commented May 17, 2024

cockroach-teamcity commented May 17, 2024

dhartunian left a comment

Choose a reason for hiding this comment

xinhaoz left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

xinhaoz commented May 24, 2024

xinhaoz commented May 28, 2024

xinhaoz commented May 17, 2024 •

edited