Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: correct error when logging timed out and cancelled queries #124371

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

xinhaoz
Copy link
Member

@xinhaoz xinhaoz commented May 17, 2024

Note to reviewers: only the latest commit should be reviewed in this PR.

Setting a query timeout or cancellation error occurs in execStmtInOpenState,
while logging occurs in dispatchToExecutionEngine (called on by the former).
This previously resulted in the incorrect pg error code being surfaced
when logging cancelled queries. This commit moves statement logging
from dispatchToExecutionEngine to execStmtInOpenState so that we have the
correct error information at the time of logging.

Preserving pausible portal logging behaviour:
There are some nuances in logging when we have a pausable portal that
must be preserved. For pausible portals, logging was triggered as a cleanup
action in execStmtInOpenState by appending it to a list of cleanup fns
in dispatch. This is now accomplished by performing logging in the
processCleanupFunc helper within execStmtInOpenState.

Epic: none
Fixes: #122722

Release note (bug fix): Fixes a bug in logging where error code was misreported
for cancelled queries. Affected channels: SQL_PERF (slow query logs), SQL_EXEC
(sql exec logs).

Copy link

blathers-crl bot commented May 17, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@xinhaoz xinhaoz changed the title logtestutils: generalize structured logging spy sql: correct error when logging timed out and cancelled queries May 17, 2024
@xinhaoz xinhaoz force-pushed the move-logging-from-dispatch branch 2 times, most recently from 298f0dd to 58d99cd Compare May 21, 2024 13:44
@xinhaoz xinhaoz marked this pull request as ready for review May 21, 2024 13:44
@xinhaoz xinhaoz requested a review from a team as a code owner May 21, 2024 13:44
@xinhaoz xinhaoz requested review from kyle-a-wong, a team and dhartunian and removed request for a team May 21, 2024 13:44
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm continuing to think through this but publishing a few initial thoughts.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kyle-a-wong and @xinhaoz)


pkg/sql/conn_executor_exec.go line 1869 at r3 (raw file):

				bulkJobId = res.GetBulkJobId()
			}
			planner.maybeLogStatement(

can you explain how removing this maybeLogStatement instance doesn't reduce logging in certain scenarios? I assume it's here to capture scenarios that the one in line 853 does not.


pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

			}

			p.maybeLogStatement(

Maybe it will help to discuss what's going on in person, but is it possible and perhaps less disruptive to move the location of maybeLogStatement instead of moving the error rewriting logic into this defer?


pkg/sql/conn_executor_exec.go line 983 at r4 (raw file):

	}

	dispatchToExecEngine = true

why is this here?

@xinhaoz xinhaoz force-pushed the move-logging-from-dispatch branch from 58d99cd to db0b2af Compare May 21, 2024 15:54
Copy link
Member Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @kyle-a-wong)


pkg/sql/conn_executor_exec.go line 1869 at r3 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can you explain how removing this maybeLogStatement instance doesn't reduce logging in certain scenarios? I assume it's here to capture scenarios that the one in line 853 does not.

Hmm I'll try that. I thought it might be important to preserve where logging was occuring in the defer / cleanup process but maybe moving it further down in the defer chain might just work 🤔 It'll be cleaner if we can keep the error setting where it currently is.


pkg/sql/conn_executor_exec.go line 983 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

why is this here?

Added a comment for this at declaration. The previous behaviour here was to log for the admin audit log if we did not dispatch to the exec engine. I'm trying to preserve what seems like was 'special case' behaviour for logging in that case, where we needed to init the plan in a particular way.

@xinhaoz xinhaoz force-pushed the move-logging-from-dispatch branch from db0b2af to fcc9e83 Compare May 21, 2024 17:29
@xinhaoz xinhaoz requested a review from rafiss May 21, 2024 17:45
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @kyle-a-wong)


pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Maybe it will help to discuss what's going on in person, but is it possible and perhaps less disruptive to move the location of maybeLogStatement instead of moving the error rewriting logic into this defer?

on the main issue, i had a similar instinct: #122722 (comment)

i feel like it would be less disruptive, which is especially relevant here since we'd want this fix backported. (so it should be the smallest diff we can achieve.) i'm happy to discuss the challenges with that here or in person.

@xinhaoz
Copy link
Member Author

xinhaoz commented May 24, 2024

pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

on the main issue, i had a similar instinct: #122722 (comment)

i feel like it would be less disruptive, which is especially relevant here since we'd want this fix backported. (so it should be the smallest diff we can achieve.) i'm happy to discuss the challenges with that here or in person.

@rafiss Ah yeah I ended up doing both here (moving the logging to execStmt and also moving the error setting logic). Logging already existed in execStmtInOpenState for statements that don't get dispatched to the execution engine. I was not sure if the order of operations in execStmtInOpenState was significant w.r.t logging so I kept logging where it currently was and moved the error writing there. Perhaps moving the logging to happen after query cancellation is innocuous though, and then we can keep the error writing logic as is. I'm going to try that and see if anything breaks :)

@xinhaoz
Copy link
Member Author

xinhaoz commented May 28, 2024

pkg/sql/conn_executor_exec.go line 853 at r4 (raw file):

Previously, xinhaoz (Xin Hao Zhang) wrote…

@rafiss Ah yeah I ended up doing both here (moving the logging to execStmt and also moving the error setting logic). Logging already existed in execStmtInOpenState for statements that don't get dispatched to the execution engine. I was not sure if the order of operations in execStmtInOpenState was significant w.r.t logging so I kept logging where it currently was and moved the error writing there. Perhaps moving the logging to happen after query cancellation is innocuous though, and then we can keep the error writing logic as is. I'm going to try that and see if anything breaks :)

There are a few things that make moving logging within execStmtInOpenState to occur after the error setting more difficult than the current solution of moving the error setting to the logging.

Currently the order of operations is (omitting ops irrelevant to this situation): log, finish instrumentation helper (which finishes the tracing span used for some logging operations), set the query cancellation and timeout errors + cancel the query (these last 2 things happen in the same deferred function).

The order above needs to be preserved as far as I've been able to tell since logging relies on the I.H. not being cleaned up, and parts of the instrumentation helper cleanup rely on the query's context still being active.

It seems to me like it's easier to move the error setting logic, but I might be missing some context - @rafiss is there a specific reason it needs to occur right before query cancellation? I also noticed setting a query timeout error is not in the processCleanupFn for pausible portals - what's important there in terms of the order of operations that need to be preserved for pausible portals? I've currently moved it to be within the processCleanupFn for logging but can mirror the old behaviour with some additional comments if given the context.

Setting a query timeout or cancellation error occurs in execStmtInOpenState,
while logging occurs in dispatchToExecutionEngine (called on  by the former).
This previously resulted in the incorrect pg error code being surfaced
when logging cancelled queries.  This commit moves statement logging
from dispatchToExecutionEngine to execStmtInOpenState so that we have the
correct error information at the time of logging.

Preserving pausible portal logging behaviour:
There are some nuances in logging when we have a pausable portal that
must be preserved. For pausible portals, logging was triggered as a cleanup
action in `execStmtInOpenState` by appending it to a list of cleanup fns
in `dispatch`. This is now accomplished by performing logging in the
`processCleanupFunc` helper within `execStmtInOpenState`.

Epic: none
Fixes: cockroachdb#122722

Release note (bug fix): Fixes a bug in logging where error code was misreported
for cancelled queries. Affected channels: SQL_PERF (slow query logs), SQL_EXEC
(sql exec logs).
@xinhaoz xinhaoz force-pushed the move-logging-from-dispatch branch from fcc9e83 to f2e86d0 Compare May 29, 2024 20:34
@xinhaoz xinhaoz requested a review from rafiss May 29, 2024 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: slow query log does not report cancellation error code
4 participants