Notifications for RAFT status changes (e.g. leadership changes) #1549

psvensson · 2024-01-01T11:01:51Z

This is related to, but different from the changeset issue discussed here; #127

My use-case is that I have logic on top of rqlite which need to know if it is running on a leader or a follower, so that it can be sure that recurring administrative things (or reactions to writes being passed down) are only done once.

This could be done by polling the local instance and see if it is the leader or not, but for write-related activities this do incur a bit of penalty.

If it would be possible to register a callback url with rqlite which reports leadership changes for the cluster, the correct information would always be ready, so to speak.

I'm not sure how hard this would be to implement, but thought I'd do my best to ask anyhow.

otoolep · 2024-01-01T15:48:07Z

Thanks for the suggestion. Let me ask you a question -- what if, by the time the recipient received the callback, the node was no longer the Leader? What kind of guarantees do you actually need? It's a small window, but it would be possible that your application would receive out-of-date information. It's the nature of distributed systems.

otoolep · 2024-01-01T15:48:55Z

The same issue applies to simple polling too. There is no guarantee that a node is still Leader by the time you make a decision based on a polled value.

psvensson · 2024-01-01T16:06:33Z

Yes, you are right. If there is a need to be 100% sure at all times that one is running on the same node as the leader (for example) then there would be a gap which eventually would lead to inconsistencies.

Hmm, this is probably the reason for those weird 'at least once' delivery guarantees in distributed messaging systems.

For my use-case (I'm implementing a kind of live query service on a layer above rqlite, which runs on the same node and is the gateway to it).

What could happen (or would eventually happen) is that during a rqlite leadership change, two nodes (former and new leader) receive write queries for the same data, both would trigger a live query and in half of the cases, the updates would come in the wrong order.

Now, the communication between rqlite and the other logic would go through the loopback interface so it would be pretty fast, but it could still happen.

I could mitigate the problem be making sure to include a timestamp with the time the original write query was received on all live queries, which would make it possible for a buffering consumer to sort out the problem.

So by accepting a system with slower updating I think I could make it work.
What do you think?

otoolep · 2024-01-01T16:32:24Z

I need to understand better what you're trying to do, and why. What problem are you actually trying to solve? https://xyproblem.info/

In my experience, systems that have to accept multiple writes, but then dedupe them on the fly, or require only a special node to to them, are often modeling their system incorrectly. Instead try to figure out a way that if the multiple writes happen it doesn't really matter.

Or another way of looking at is that you're trying to rebuild Raft - an algorithm that guarantees there will only one special node at anytime making changes.

psvensson · 2024-01-01T17:01:16Z

OK, in short, I'm trying to build (right now I'm just modelling and documenting and thinking, looking what can be done and trying to find problems beforehand) a simple database which is partitioned into smaller partitions. Each partition can modeled by a rqlite cluster (for example).

The system will store partition and replication information about itself in itself and each node need to be updated when changes occur.

I would like to implement what I call 'live queries' so that modifying queries to a partition can result in updates being sent to all the nodes in the system. As described now it will mean that the total number of nodes will not be large, but still a good start. The live query functionality can then also be used for other, less global things.

I think that probably the best way to implement the live query functionality that I'm after would be to run code on the raft instances themselves.

I wouldn't say I'm trying to rebuild raft, rather leveraging that specific behavior to do an additional thing, if that makes sense.

otoolep · 2024-01-01T17:08:52Z

The system will store partition and replication information about itself in itself and each node need to be updated when changes occur.

Is "node" in this context a node in a single rqlite cluster? Or is "node" an entire rqlite cluster, with a cluster mapping to one of these "smaller" partitions?

psvensson · 2024-01-01T17:14:44Z

Sorry to be unclear, node is a virtual or physical computer which will run an instance of the service I'm describing and might also run a rqlite replica.

otoolep · 2024-01-01T18:41:31Z

I would like to implement what I call 'live queries' so that modifying queries to a partition can result in updates being sent to all the nodes in the system.

But isn't that exactly what rqlite does? You can send a SQL statement (INSERT, UPDATE, DELETE) to any node in an rqlite cluster, and rqlite ensures that that statement is sent to every node in the cluster (at least every node that is connected). So write to the SQLite database itself to communicate changes around your cluster.

I would like to implement what I call 'live queries' so that modifying queries to a partition can result in updates being sent to all the nodes in the system.

In sense you're still not answering my question. :-) Why do you need to send updates to all the nodes in the system? You don't have to answer, it's just difficult to know what is a good solution. For example your answer might be "I'm afraid of losing data, hence want the updates reflected on each node so I have multiple copies" or "I want to serve queries locally on each node", etc etc. All of which rqlite does for you.

psvensson · 2024-01-01T19:00:52Z

OK, I hear you. I agree that it might feel superfluous to implement a service that does a thing on top of another service doing the same.

The reason for this is that a raft group can't comfortably grow to a size of, let's say 100 nodes, which I would like to achieve (or more, but just to give you a number).

My goal is a distributed database that works a bit like Google's Spanner or CockroachDB. Each partition is its own replica (raft) group. The underlying engine need not be as powerful as r/sqlite, but it is nice to have of course.

I hope I answer your question by saying that I need all the nodes to know where all partitions are, so that any code running on them (or requests they serve from a client to the system) can by looking at the WHERE clause (or similar) which partition(s) are affected by the query.

Without keeping the nodes up to date with partitioning and replication status, they would not know to which other node(s) to relay the request.

I do feel that we might be veering just a little bit of target to the original request but I'm happy describing the system in full if you wish :)

otoolep · 2024-01-01T19:27:00Z

OK, clearly there is a bigger picture here. Fair enough. I presume you've read about the existing solutions that sound similar:

Without keeping the nodes up to date with partitioning and replication status, they would not know to which other node(s) to relay the request.

There are other ways to solve this. Spanner (and InfluxDB) keep this "metadata" in a dedicated HA system, such as a 3-node rqlite cluster. Then all other nodes (which are not running Raft, just serving queries) check this dedicated system when they need to find the location of a given shard. Every node doesn't haven't a full copy, they simply consult the system-of-truth. Zookeeper can do this for you too. That is how those systems actually work.

Anyway, I hope this helps. I'm not convinced that what you originally requested would actually meet your needs. I'd need to see it in the context of a larger design to understand (which I'm not looking for).

otoolep · 2024-01-01T19:28:45Z

Check out the last few slides of this presentation I gave back in 2015:

psvensson · 2024-01-01T19:37:07Z

Thank you, I hadn't seen multi-raft actually.
Yes, one could use a separate service of course, but now that I've figured out a way to use the system to serve itself it feels like a much neater solution.

I'll take a look at the slides, also.

I hear you, I'll go another route then, no problem. Thanks for the conversation.

otoolep · 2024-01-01T19:45:16Z

I'll reopen the request, because I do think it's a reasonable feature request. It may be something I implement at some point.

psvensson · 2024-01-01T20:27:28Z

OK, thank you

psvensson closed this as completed Jan 1, 2024

otoolep reopened this Jan 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notifications for RAFT status changes (e.g. leadership changes) #1549

Notifications for RAFT status changes (e.g. leadership changes) #1549

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

Notifications for RAFT status changes (e.g. leadership changes) #1549

Notifications for RAFT status changes (e.g. leadership changes) #1549

Comments

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024

otoolep commented Jan 1, 2024

psvensson commented Jan 1, 2024