Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy current WAL in consistent state during checkpoint #12671

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andlr
Copy link
Contributor

@andlr andlr commented May 16, 2024

Due to a few data race issues, sometimes active WAL file gets copied in inconsistent state during checkpoint operation.
Database open from such checkpoint fails with one of these errors when wal_recovery_mode=kAbsoluteConsistency:

  • Corruption: truncated record body
  • Corruption: error reading trailing data

This happens because size of the active WAL file is captured at a random moment:

  • truncated record body error happens when WAL file size is captured right after WritableFileWriter flush when in-memory buffer no longer has space for new data
  • error reading trailing data happens, when WAL record gets broken down into multiple physical records(fragments), and WAL file size was captured before last fragment has been written.

The fix does the following:

  • keeps track of the latest offset where WAL file is consistent in log::Writer;
  • in GetLiveFilesStorageInfo, captures current log number (if there is active WAL) and it's latest consistent offset, right after capturing manifest and options state;
  • after enumerating WAL files in directory, for active WAL, sets trim_to_size=true + size to the consistent offset, captured previously;
  • WAL files, created during GetLiveFilesStorageInfo call (between mutex_.Unlock and GetSortedWalFiles) are ignored, since we don't know which portion of such WAL we have to copy so it would be copied in a consistent state.

Fixes #12670

@andlr andlr force-pushed the wal-corruption-in-checkpoints branch from 80f2e17 to c9ee3c8 Compare May 16, 2024 21:07
@andlr andlr marked this pull request as draft May 16, 2024 22:50
@andlr andlr force-pushed the wal-corruption-in-checkpoints branch from c9ee3c8 to 65d8634 Compare May 19, 2024 10:30
@andlr andlr marked this pull request as ready for review May 19, 2024 11:34
@andlr andlr changed the title Avoid copying active WAL in inconsistent state during checkpoint Copy active WAL in consistent state during checkpoint May 22, 2024
@andlr andlr changed the title Copy active WAL in consistent state during checkpoint Copy current WAL in consistent state during checkpoint May 22, 2024
@andlr andlr force-pushed the wal-corruption-in-checkpoints branch from 65d8634 to 1299772 Compare May 28, 2024 20:53
@andlr andlr force-pushed the wal-corruption-in-checkpoints branch from c4c99d1 to 7675457 Compare June 2, 2024 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some checkpoints cannot be opened with kAbsoluteConsistency WAL recovery mode
2 participants