Fix CRC16 Hashslot Calculation #399

vazois · 2024-05-18T04:12:26Z

This PR addresses the hash slot calculation issue #395 by ensuring that key strings without both left and right brackets (e.g., "Hm{W\x13\x1c") are used in their entirety within the CRC16 hash function.
I also added more tests and separated CRC16 hashlot implementation into its own folder.

PaulusParssinen · 2024-05-18T05:00:40Z

Not directly related to the PR but thought this would be good place to ask 😄

Any background on why was CRC16 chosen as the key slot hash function? There's much more effort spent in industry to optimize CRC32 variants. We have CRC32C available as intrinsic and there's a lot of even faster tricks with SIMD. The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

cc @badrishc

vazois · 2024-05-18T05:24:48Z

Not directly related to the PR but thought this would be good place to ask 😄

Any background on why was CRC16 chosen as the key slot hash function? There's much more effort spent in industry to optimize CRC32 variants. We have CRC32C available as intrinsic and there's a lot of even faster tricks with SIMD. The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

cc @badrishc

I believe it is done as precaution since people expect it to be crc16 like what Redis is using.
If there is no reason for the consumers of Garnet to expect CRC16 mapping, we can use whatever hash function (even non crc based one.
I didn't know that people expect it to be explicitly CRC16 until an issue was opened, which actually relates to a problem with how we extract the substring with the hashtags and not the hash function itself.

PaulusParssinen · 2024-05-18T08:21:37Z

I believe it is done as precaution since people expect it to be crc16 like what Redis is using. If there is no reason for the consumers of Garnet to expect CRC16 mapping, we can use whatever hash function (even non crc based one. I didn't know that people expect it to be explicitly CRC16 until an issue was opened, which actually relates to a problem with how we extract the substring with the hashtags and not the hash function itself.

Ah right, I didn't know it was specced out to Redis. Looking at it now it makes sense as there's 16384 slots. Disregard my message 😄

badrishc · 2024-05-18T09:07:35Z

Not directly related to the PR but thought this would be good place to ask 😄

Any background on why was CRC16 chosen as the key slot hash function? There's much more effort spent in industry to optimize CRC32 variants. We have CRC32C available as intrinsic and there's a lot of even faster tricks with SIMD. The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

cc @badrishc

Hi Paulus, can you confirm that the changes you made to CRC logic do not change the hash computation compared to what was there before, so that we retain the compatibility with what Redis uses?

PaulusParssinen · 2024-05-18T10:58:54Z

There was no behaviour change in that minor optimization.

CRC16 Loop body (#198)

Before

;  arg1          int  ->  rdx         single-def
;  loc0       ushort  ->  rsi        
;  loc1         long  ->  rdi         single-def
;  OutArgs    struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  tmp1         long  ->  rbx         "impSpillLclRefs"
;  cse1         long  ->  rax         hoist "CSE #02: aggressive"

; ushort result = 0;
       xor      esi, esi
; byte* end = data + len;
       movsxd   rdi, edx
       add      rdi, rbx
; while (data < end)
       cmp      rbx, rdi
       jae      SHORT LOOP_EXIT
; ...
       mov      LUT_PTR, 0x7FFC59808C18    ; LUT_PTR = 0x7FFC59808C18;
LOOP_ENTRY:
       lea      rcx, [rbx+0x01]            ; rcx = rbx + 1; 
       mov      rbp, rcx                   ; rbp = rcx;
       mov      rcx, qword ptr [LUT_PTR]   ; rcx = [LUT_PTR];
	  
; tmp = (((result >> 8) ^ *data++) & 0xff)
       mov      edx, esi                   ; edx = esi; 
       sar      edx, 8                     ; edx >>= 8; 
       movzx    r8, byte  ptr [rbx]        ; r8 = [rbx]; Load byte from address in rbx into r8
       xor      edx, r8d                   ; edx = edx ^ r8d;
       movzx    rdx, dl                    ; rdx = dl;

; result = (ushort)(*(LUT_PTR + tmp) ^ (result << 8));
       movzx    rcx, word  ptr [rcx+2*rdx] ; rcx = [rcx + 2*rdx];
       shl      esi, 8                     ; esi <<= 8;
       xor      ecx, esi                   ; ecx = ecx ^ esi;
       movzx    rsi, cx                    ; rsi = cx;

       cmp      rbp, rdi
       mov      rbx, rbp
       jb       SHORT LOOP_ENTRY
LOOP_EXIT:
; ...

After

;  arg1           int  ->  rdx         single-def
;  loc0        ushort  ->  rax        
;  loc1          long  ->  rdx         single-def
;  loc2          long  ->  rcx        
;  cse0          long  ->   r8         hoist "CSE #01: aggressive"

; ushort result = 0;
       xor      eax, eax
; byte* end = data + len;
       movsxd   rdx, edx
       add      rdx, rcx
; while (data < end)
       cmp      rcx, rdx
       jae      SHORT EXIT

       mov      LUT_PTR, 0x2195CA42EB0      ; LUT_PTR = 0x2195CA42EB0;
LOOP_ENTRY:  ;; offset=0x0017
; tmp = data++
       lea      r10, [rcx+0x01]

; nuint index = (nuint)(uint)((result >> 8) ^ *(data + 1)) & 0xff;
       mov      r9d, eax                    ; r9d = eax;
       sar      r9d, 8                      ; r9d >>= 8
       movzx    rcx, byte  ptr [rcx]        ; rcx = [rcx]  
       xor      ecx, r9d                    ; ecx = ecx ^ r9d
       movzx    rcx, cl                     ; zero extend

; result = (ushort)(Unsafe.Add(LUT_PTR, rcx) ^ (result << 8));
       movzx    rcx, word  ptr [LUT_PTR+2*rcx] ; rcx = [LUT_PTR + 2*rcx];
       shl      eax, 8                         ; eax <<= 8; 
       xor      eax, ecx                       ; eax = eax ^ ecx;
       movzx    rax, ax                        ; rax = ax;

       cmp      r10, rdx
       mov      rcx, r10
       jb       SHORT LOOP_ENTRY
EXIT:
; ...

badrishc · 2024-05-18T18:46:09Z

The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

The cluster spec requires us to map keys to the 16384 hash slot space. This is a very common server side operation and has measurable perf impact, so if there is an opportunity to speed it up that would be good.

But I think existing clients expect the specific CRC mapping chosen by Redis as they might use this to route keys? (@mgravell can confirm).

badrishc · 2024-05-18T18:50:39Z

Unrelated, speaking of hash code logic, our server side key hash logic is here: https://github.com/microsoft/garnet/blob/main/libs/storage/Tsavorite/cs/src/core/Utilities/Utility.cs#L179

Any way to improve its speed, while maintaining (or improving) the good hash distribution/spread property, would be interesting as well.

vazois · 2024-05-18T19:04:21Z

The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

The cluster spec requires us to map keys to the 16384 hash slot space. This is a very common server-side operation and has measurable perf impact, so if there is an opportunity to speed it up that would be good.
a
But I think existing clients expect the specific CRC mapping chosen by Redis as they might use this to route keys? (@mgravell can confirm).

I guess the issue is that when routing requests, the client sees only the key and it has to perform the hashslot calculation to figure out which shard is responsible for that key.
We can use a different hash function but the client needs to be aware of it to perform the correct hashslot calculation and route the request appropriately.
I am wondering if there is a way in C# to provide a string code snippet and execute it at runtime. Could be a solution for the client to acquire the HashSlot calculation implementation. Though it sounds risky from security perspective.

badrishc · 2024-05-18T19:21:39Z

Have you verified that clients are doing this CRC computation, or are they just querying some server and using the redirect to determine and cache the mapping?

vazois · 2024-05-18T19:50:28Z

Have you verified that clients are doing this CRC computation, or are they just querying some server and using the redirect to determine and cache the mapping?

I did not check this at first, just assumed this is how they would be doing it. However, I quick search reveals that SERedis is doing something like what I said above because I can see they have their own CRC16 implementation

https://github.com/StackExchange/StackExchange.Redis/blob/61c13c21844ff3e92eb077523dc876688878ba25/src/StackExchange.Redis/ServerSelectionStrategy.cs#L63

mgravell · 2024-05-18T20:53:03Z

I can 100% confirm that cluster-aware clients try to perform the hash locally for routing. The alternatives are: 1. perform dumb routing and blindly respond to `-MOVED` (or issue the command that returns the slot) - adds latency, makes a mess of ordering 2. use a cluster-aware proxy, which just kicks the can up a step to the proxy, since the proxy still has to route somehow The client must know the hashing algorithm to route efficiently. At the moment *only* the predefined crc16 with the specified seed data is supported. Could another algorithm be negotiated? I mean, maybe, if it is demonstrably faster, but asking client libraries to support it would be painfully slow - it basically blocks all clients until they choose to implement that hash (if they ever do), plus we'd need to come up with a mechanism to even convey that difference.

…

On Sat, 18 May 2024, 20:50 vazois, ***@***.***> wrote: Have you verified that clients are doing this CRC computation, or are they just querying some server and using the redirect to determine and cache the mapping? I did not check this at first, just assumed this is how they would be doing it. However, I quick search reveals that SERedis is doing something like what I said above because I can see they have their own CRC16 implementation https://github.com/StackExchange/StackExchange.Redis/blob/61c13c21844ff3e92eb077523dc876688878ba25/src/StackExchange.Redis/ServerSelectionStrategy.cs#L63 — Reply to this email directly, view it on GitHub <#399 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAEHMCDT3NHXAWKLLBC4RDZC6WJTAVCNFSM6AAAAABH5D2B2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYHE4DQMZTGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

PaulusParssinen · 2024-05-19T12:19:55Z

I think cost of switching the hash function greatly outweights any benefits of slightly faster hashing for the cluster slot given it's apparently not a implementation detail. Also I think naive CRC16 lookup table loop is doing just good enough here given the inputs are small.

Unrelated, speaking of hash code logic, our server side key hash logic is here: https://github.com/microsoft/garnet/blob/main/libs/storage/Tsavorite/cs/src/core/Utilities/Utility.cs#L179

Any way to improve its speed, while maintaining (or improving) the good hash distribution/spread property, would be interesting as well.

This seems to be implementation detail on the other hand and probably could be improved but with good benchmark/profiling data as proof of course.

vazois added 3 commits May 17, 2024 20:02

fix bug and refactor crc16 to HashSlotUtils

0a6f234

fix migration tests

830302c

add tests for cluster keyslot

ed95d5b

vazois force-pushed the vazois/fix-crc16 branch from 6c10afa to 3f6a5ef Compare May 18, 2024 05:06

fix formatting errors

198d93a

vazois force-pushed the vazois/fix-crc16 branch from 3f6a5ef to 198d93a Compare May 18, 2024 05:59

badrishc approved these changes May 20, 2024

View reviewed changes

vazois merged commit 8ef45f9 into microsoft:main May 20, 2024
23 checks passed

vazois deleted the vazois/fix-crc16 branch June 4, 2024 19:00

vazois restored the vazois/fix-crc16 branch June 4, 2024 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CRC16 Hashslot Calculation #399

Fix CRC16 Hashslot Calculation #399

vazois commented May 18, 2024

PaulusParssinen commented May 18, 2024 •

edited

vazois commented May 18, 2024 •

edited

PaulusParssinen commented May 18, 2024 •

edited

badrishc commented May 18, 2024 •

edited

PaulusParssinen commented May 18, 2024 •

edited

badrishc commented May 18, 2024 •

edited

badrishc commented May 18, 2024

vazois commented May 18, 2024

badrishc commented May 18, 2024 •

edited

vazois commented May 18, 2024

mgravell commented May 18, 2024 via email

PaulusParssinen commented May 19, 2024 •

edited

Fix CRC16 Hashslot Calculation #399

Fix CRC16 Hashslot Calculation #399

Conversation

vazois commented May 18, 2024

PaulusParssinen commented May 18, 2024 • edited

vazois commented May 18, 2024 • edited

PaulusParssinen commented May 18, 2024 • edited

badrishc commented May 18, 2024 • edited

PaulusParssinen commented May 18, 2024 • edited

CRC16 Loop body (#198)

badrishc commented May 18, 2024 • edited

badrishc commented May 18, 2024

vazois commented May 18, 2024

badrishc commented May 18, 2024 • edited

vazois commented May 18, 2024

mgravell commented May 18, 2024 via email

PaulusParssinen commented May 19, 2024 • edited

PaulusParssinen commented May 18, 2024 •

edited

vazois commented May 18, 2024 •

edited

PaulusParssinen commented May 18, 2024 •

edited

badrishc commented May 18, 2024 •

edited

PaulusParssinen commented May 18, 2024 •

edited

badrishc commented May 18, 2024 •

edited

badrishc commented May 18, 2024 •

edited

PaulusParssinen commented May 19, 2024 •

edited