Bug Report: VPlayer does not detect stalls #15974

mattlord · 2024-05-18T18:42:56Z

Overview of the Issue

The VPlayer can become stuck/stalled without realizing it itself and thus without generating a helpful error and/or log message. Instead the user only sees very vague EOF errors/messages after things have been stuck/stalled for some time. It is very unclear what is happening and what the user could/should do to try and correct it.

Detailed description and explanation of a case seen in production

MySQL connection:

The filtered DB user that VReplication uses (vplayer in this case) on the target is executing replicated inserts/updates/deletes that are part of a transaction wrapping the current batch of events (along with updating the workflow's position when the batch commits/completes).

VReplication / OnlineDDL:

We're regularly going into the error state and then retrying.
But we're not making any progress, we're stuck on the same GTID position.
In the end, we saw that this was because each query we ended up executing on the target (by the vplayer) — from the replicated binlog events — was doing a table scan and it was a large table. This was an OnlineDDL bug resulting from how we handle when the charset + collation is changed on PK columns — but that is beside the point here as it's a general scenario that can occur in VReplication.

Thesis:

The vplayer is stuck trying to process the last batch read from the relay log. It cannot complete/commit that batch quickly as each table WRITE is doing a scan (each one taking ~ 3-5 seconds in case seen) and thus it cannot complete the batch of contents last read from the relay log in a typically reasonable amount of time.
This means that the goroutine writing the next batch to the relay log ends up blocked as we hit the configured max relay log items / bytes on the target.
This in turn means that we eventually cannot read more events from the open com_binlog_dump_gtid connection on the source tablet as that's what is then sent to the vplayer on the target tablet (via the VStream RPC call that the target made to the source) to write to the relay log.
When in this state the load resulting from replication combined with a high write rate coming in from the user/application — for OnlineDDL everything is done on the shard primary (it's the source and target for the workflow) — we're not able to generate new heartbeat events and thus mysqld ends up closing the binlog dump connection when no more IO is done by the time we hit slave_net_timeout. When this error eventually ends the VStream RPC from the target to the source — it takes quite some time as the vreplicator will retry internally when an ephemeral error is encountered in the stream — the workflow goes into the error state (to be retried 5 seconds later by default).
So we are stuck in this loop indefinitely and not making any progress.
⚠️ Once in this state there's a backlog of pending lock requests for the workflow's _vt.vreplication record and thus you cannot even STOP the workflow as that is done by updating the record and setting the state = Stopped (the vreplicator keeps trying to send heartbeats through the MySQL replication stream from the source to target via vreplicator.updateHeartbeatTime() with each attempt hanging, at least until innodb_lock_wait_timeout).

This is a valid scenario that can be encountered. The real problem here is that it was very much NOT obvious what was happening and where the real problem lied. Ideally we should provide information that allows the user to zero in on the actual problem as described above: the queries being executed via the replication stream are doing table scans and we'll need to address that somehow: new indexes, workflow adjustments (in the seen case we needed to adjust the workflow binlog source filter that OnlineDDL created), source and/or target tablet config adjustments (increasing slave_net_timeout on the source e.g.), etc.

Reproduction Steps

A simple test case that demonstrates the problem

git checkout main && make build

pushd examples/local

./101_initial_cluster.sh; mysql < ../common/insert_commerce_data.sql

./201_customer_tablets.sh; ./202_move_tables.sh

primaryuid=$(vtctldclient GetTablets --keyspace customer --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)

command mysql -u root --socket ${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock -e "begin; select * from _vt.vreplication for update; sleep 600; rollback"

Test case which demonstrates a more typical scenario that would be seen in production

git checkout main && make build

pushd examples/local

./101_initial_cluster.sh

primaryuid=$(vtctldclient GetTablets --keyspace commerce --tablet-type primary --shard "0" | awk '{print $1}' | cut -d- -f2 | bc)

mysql -e "CREATE TABLE t1 (id varchar(128), id2 varchar(32) NOT NULL, updated_at datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, created_at datetime NOT NULL DEFAULT CURRENT_TIMESTAMP, colstuff varchar(64) NOT NULL, PRIMARY KEY (id,id2))"

# Disable semi-sync to load data much faster...
../common/scripts/vtorc-down.sh
command mysql -u root --socket ${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock -e "set global rpl_semi_sync_replica_enabled=OFF; set global rpl_semi_sync_source_enabled=OFF;"

# Let this rip for a while ...
while true; do mysql commerce -e "insert into t1 (id, id2, colstuff) values ('${RANDOM}${RANDOM}${RANDOM}_id', '${RANDOM}${RANDOM}${RANDOM}_id2', '${RANDOM}${RANDOM}${RANDOM}_colstuff')"; done

# Wait until you have ~ 500k rows or more. Let it continue to run and generate new binlog events...
while true; do count=$(command mysql -u root -Ns --socket ${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock -e "select count(*) from vt_commerce.t1"); if [[ ${count} -ge 500000 ]]; then say "We are done with the data load"; break; sleep 60; fi; done

uuid=$(vtctldclient ApplySchema --ddl-strategy "vitess --postpone-completion" --sql "ALTER TABLE t1 MODIFY id varchar(128) character set ascii" commerce)

The OnlineDDL generated binlog source filter results in all updates/deletes doing full table scans because the PK index cannot be used due to the charset conversion going in the wrong direction for the target side during the running/replicating phase:

❯ command mysql -u root --socket ${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock -e "select id, workflow, source, pos from _vt.vreplication where workflow = '${uuid}'\G"
*************************** 1. row ***************************
      id: 1
workflow: 134cecfc_14bb_11ef_80d1_920702940ee1
  source: keyspace:"commerce" shard:"0" filter:{rules:{match:"_vt_vrp_7f8b11f81dc811efa47d920702940ee1_20240529103405_" filter:"select convert(`id` using utf8mb4) as `id`, `id2` as `id2`, `updated_at` as `updated_at`, `created_at` as `created_at`, `colstuff` as `colstuff` from `t1`" convert_charset:{key:"id" value:{from_charset:"utf8mb4" to_charset:"ascii"}} source_unique_key_columns:"id,id2" target_unique_key_columns:"id,id2" source_unique_key_target_columns:"id,id2" force_unique_key:"PRIMARY"}}
     pos: MySQL56/e30d35c4-14ba-11ef-8ec4-0de55702061a:1-629527

# Now start generating random updates, which will do a table scan for each row changed when replicated
command mysql -u root --socket ${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock -e "create database mysqlslap"
mysqlslap -c 50 -i 100000 -q "update vt_commerce.t1 set colstuff = 'foobar' order by rand() limit 20" -u root --socket ${VTDATAROOT}/vt_0000000${primaryuid}/mysql.sock

# While that's running, stop the OnlineDDDL vreplication workflow to allow binlog events to build up a bit
vtctldclient workflow --keyspace commerce stop --workflow ${uuid}

# Wait 5 mins or so...

# Start it back up again:
vtctldclient workflow --keyspace commerce stop --workflow ${uuid}

Binary Version

❯ vtgate --version
vtgate version Version: 20.0.0-SNAPSHOT (Git revision 76988cf52cf4bdb785d106c9d5e8c2cede9a2b8b branch 'vplayer_batch_trx_timeout') built on Sat May 18 14:25:41 EDT 2024 by matt@pslord.local using go1.22.3 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

vstreamer.go:1022] stream (at source tablet) error @ 056f7cce-9679-11ee-93de-7a43718bd292:1-88,05b63ad0-9679-11ee-8848-b6dd624d298a:1-339542,4e956b44-99ec-11ee-b993-76445c58387d:1-6192085,4e980878-99ec-11ee-9055-cae1ac6f6e27:1-17946122,b912bee5-99ec-11ee-8938-46a387219895:1-10463847: EOF (errno 2013) (sqlstate HY000)
dbclient.go:125] error in stream 50, will retry after 5s: vttablet: rpc error: code = Unknown desc = stream (at source tablet) error @ 056f7cce-9679-11ee-93de-7a43718bd292:1-88,05b63ad0-9679-11ee-8848-b6dd624d298a:1-339542,4e956b44-99ec-11ee-b993-76445c58387d:1-6196344,4e980878-99ec-11ee-9055-cae1ac6f6e27:1-17946122,b912bee5-99ec-11ee-8938-46a387219895:1-10463847: unexpected EOF
binlog_connection.go:164] connection closed during binlog stream (possibly intentional): unexpected EOF

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2024-05-27T11:38:14Z

for OnlineDDL everything is done on the shard primary (it's the source and target for the workflow)

Clarification: this is not the case today. The default in Online DDL is for vreplication to read row data and binlog event from a REPLICA tablet:

vitess/go/vt/vttablet/onlineddl/vrepl.go

Lines 668 to 674 in 9fa8691

    
           func (v *VRepl) generateInsertStatement(ctx context.Context) (string, error) { 
        
           	ig := vreplication.NewInsertGenerator(binlogdatapb.VReplicationWorkflowState_Stopped, v.dbName) 
        
           	ig.AddRow(v.workflow, v.bls, v.pos, "", "in_order:REPLICA,PRIMARY", 
        
           		binlogdatapb.VReplicationWorkflowType_OnlineDDL, binlogdatapb.VReplicationWorkflowSubType_None, false) 
        
           	return ig.String(), nil 
        
           }

shlomi-noach · 2024-05-27T11:40:35Z

Is there an open/closed issue for handling vreplication PK charset conversions? I can work on that.

mattlord · 2024-05-29T14:58:47Z

@shlomi-noach I just created one: #16023

mattlord added Type: Bug Component: VReplication labels May 18, 2024

mattlord added this to the v20.0.0 milestone May 18, 2024

mattlord self-assigned this May 18, 2024

mattlord added this to Backlog in VReplication via automation May 18, 2024

mattlord moved this from Backlog to In progress in VReplication May 18, 2024

mattlord linked a pull request May 18, 2024 that will close this issue

VReplication: Improve handling of vplayer stalls #15797

Open

5 tasks

mattlord mentioned this issue May 29, 2024

Bug Report: OnlineDDL PK conversion results in table scans #16023

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: VPlayer does not detect stalls #15974

Bug Report: VPlayer does not detect stalls #15974

mattlord commented May 18, 2024 •

edited

shlomi-noach commented May 27, 2024

shlomi-noach commented May 27, 2024

mattlord commented May 29, 2024

Bug Report: VPlayer does not detect stalls #15974

Bug Report: VPlayer does not detect stalls #15974

Comments

mattlord commented May 18, 2024 • edited

Overview of the Issue

Detailed description and explanation of a case seen in production

MySQL connection:

VReplication / OnlineDDL:

Thesis:

Reproduction Steps

A simple test case that demonstrates the problem

Test case which demonstrates a more typical scenario that would be seen in production

Binary Version

Operating System and Environment details

Log Fragments

shlomi-noach commented May 27, 2024

shlomi-noach commented May 27, 2024

mattlord commented May 29, 2024

mattlord commented May 18, 2024 •

edited