Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CRC16 Hashslot Calculation #399

Merged
merged 4 commits into from
May 20, 2024
Merged

Conversation

vazois
Copy link
Contributor

@vazois vazois commented May 18, 2024

This PR addresses the hash slot calculation issue #395 by ensuring that key strings without both left and right brackets (e.g., "Hm{W\x13\x1c") are used in their entirety within the CRC16 hash function.
I also added more tests and separated CRC16 hashlot implementation into its own folder.

@PaulusParssinen
Copy link
Contributor

PaulusParssinen commented May 18, 2024

Not directly related to the PR but thought this would be good place to ask 😄

Any background on why was CRC16 chosen as the key slot hash function? There's much more effort spent in industry to optimize CRC32 variants. We have CRC32C available as intrinsic and there's a lot of even faster tricks with SIMD. The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

cc @badrishc

@vazois
Copy link
Contributor Author

vazois commented May 18, 2024

Not directly related to the PR but thought this would be good place to ask 😄

Any background on why was CRC16 chosen as the key slot hash function? There's much more effort spent in industry to optimize CRC32 variants. We have CRC32C available as intrinsic and there's a lot of even faster tricks with SIMD. The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

cc @badrishc

I believe it is done as precaution since people expect it to be crc16 like what Redis is using.
If there is no reason for the consumers of Garnet to expect CRC16 mapping, we can use whatever hash function (even non crc based one.
I didn't know that people expect it to be explicitly CRC16 until an issue was opened, which actually relates to a problem with how we extract the substring with the hashtags and not the hash function itself.

@PaulusParssinen
Copy link
Contributor

PaulusParssinen commented May 18, 2024

I believe it is done as precaution since people expect it to be crc16 like what Redis is using. If there is no reason for the consumers of Garnet to expect CRC16 mapping, we can use whatever hash function (even non crc based one. I didn't know that people expect it to be explicitly CRC16 until an issue was opened, which actually relates to a problem with how we extract the substring with the hashtags and not the hash function itself.

Ah right, I didn't know it was specced out to Redis. Looking at it now it makes sense as there's 16384 slots. Disregard my message 😄

@badrishc
Copy link
Contributor

badrishc commented May 18, 2024

Not directly related to the PR but thought this would be good place to ask 😄

Any background on why was CRC16 chosen as the key slot hash function? There's much more effort spent in industry to optimize CRC32 variants. We have CRC32C available as intrinsic and there's a lot of even faster tricks with SIMD. The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

cc @badrishc

Hi Paulus, can you confirm that the changes you made to CRC logic do not change the hash computation compared to what was there before, so that we retain the compatibility with what Redis uses?

@PaulusParssinen
Copy link
Contributor

PaulusParssinen commented May 18, 2024

There was no behaviour change in that minor optimization.

CRC16 Loop body (#198)

Before
;  arg1          int  ->  rdx         single-def
;  loc0       ushort  ->  rsi        
;  loc1         long  ->  rdi         single-def
;  OutArgs    struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  tmp1         long  ->  rbx         "impSpillLclRefs"
;  cse1         long  ->  rax         hoist "CSE #02: aggressive"

; ushort result = 0;
       xor      esi, esi
; byte* end = data + len;
       movsxd   rdi, edx
       add      rdi, rbx
; while (data < end)
       cmp      rbx, rdi
       jae      SHORT LOOP_EXIT
; ...
       mov      LUT_PTR, 0x7FFC59808C18    ; LUT_PTR = 0x7FFC59808C18;
LOOP_ENTRY:
       lea      rcx, [rbx+0x01]            ; rcx = rbx + 1; 
       mov      rbp, rcx                   ; rbp = rcx;
       mov      rcx, qword ptr [LUT_PTR]   ; rcx = [LUT_PTR];
	  
; tmp = (((result >> 8) ^ *data++) & 0xff)
       mov      edx, esi                   ; edx = esi; 
       sar      edx, 8                     ; edx >>= 8; 
       movzx    r8, byte  ptr [rbx]        ; r8 = [rbx]; Load byte from address in rbx into r8
       xor      edx, r8d                   ; edx = edx ^ r8d;
       movzx    rdx, dl                    ; rdx = dl;

; result = (ushort)(*(LUT_PTR + tmp) ^ (result << 8));
       movzx    rcx, word  ptr [rcx+2*rdx] ; rcx = [rcx + 2*rdx];
       shl      esi, 8                     ; esi <<= 8;
       xor      ecx, esi                   ; ecx = ecx ^ esi;
       movzx    rsi, cx                    ; rsi = cx;

       cmp      rbp, rdi
       mov      rbx, rbp
       jb       SHORT LOOP_ENTRY
LOOP_EXIT:
; ...
After
;  arg1           int  ->  rdx         single-def
;  loc0        ushort  ->  rax        
;  loc1          long  ->  rdx         single-def
;  loc2          long  ->  rcx        
;  cse0          long  ->   r8         hoist "CSE #01: aggressive"

; ushort result = 0;
       xor      eax, eax
; byte* end = data + len;
       movsxd   rdx, edx
       add      rdx, rcx
; while (data < end)
       cmp      rcx, rdx
       jae      SHORT EXIT

       mov      LUT_PTR, 0x2195CA42EB0      ; LUT_PTR = 0x2195CA42EB0;
LOOP_ENTRY:  ;; offset=0x0017
; tmp = data++
       lea      r10, [rcx+0x01]

; nuint index = (nuint)(uint)((result >> 8) ^ *(data + 1)) & 0xff;
       mov      r9d, eax                    ; r9d = eax;
       sar      r9d, 8                      ; r9d >>= 8
       movzx    rcx, byte  ptr [rcx]        ; rcx = [rcx]  
       xor      ecx, r9d                    ; ecx = ecx ^ r9d
       movzx    rcx, cl                     ; zero extend

; result = (ushort)(Unsafe.Add(LUT_PTR, rcx) ^ (result << 8));
       movzx    rcx, word  ptr [LUT_PTR+2*rcx] ; rcx = [LUT_PTR + 2*rcx];
       shl      eax, 8                         ; eax <<= 8; 
       xor      eax, ecx                       ; eax = eax ^ ecx;
       movzx    rax, ax                        ; rax = ax;

       cmp      r10, rdx
       mov      rcx, r10
       jb       SHORT LOOP_ENTRY
EXIT:
; ...

@badrishc
Copy link
Contributor

badrishc commented May 18, 2024

The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

The cluster spec requires us to map keys to the 16384 hash slot space. This is a very common server side operation and has measurable perf impact, so if there is an opportunity to speed it up that would be good.

But I think existing clients expect the specific CRC mapping chosen by Redis as they might use this to route keys? (@mgravell can confirm).

@badrishc
Copy link
Contributor

Unrelated, speaking of hash code logic, our server side key hash logic is here: https://github.com/microsoft/garnet/blob/main/libs/storage/Tsavorite/cs/src/core/Utilities/Utility.cs#L179

Any way to improve its speed, while maintaining (or improving) the good hash distribution/spread property, would be interesting as well.

@vazois
Copy link
Contributor Author

vazois commented May 18, 2024

The intrinsic would probably win against the SIMD for the smaller data and very likely against the current LUT approach which will incur L1-L2 misses (would have to measure this).

The cluster spec requires us to map keys to the 16384 hash slot space. This is a very common server-side operation and has measurable perf impact, so if there is an opportunity to speed it up that would be good.
a
But I think existing clients expect the specific CRC mapping chosen by Redis as they might use this to route keys? (@mgravell can confirm).

I guess the issue is that when routing requests, the client sees only the key and it has to perform the hashslot calculation to figure out which shard is responsible for that key.
We can use a different hash function but the client needs to be aware of it to perform the correct hashslot calculation and route the request appropriately.
I am wondering if there is a way in C# to provide a string code snippet and execute it at runtime. Could be a solution for the client to acquire the HashSlot calculation implementation. Though it sounds risky from security perspective.

@badrishc
Copy link
Contributor

badrishc commented May 18, 2024

Have you verified that clients are doing this CRC computation, or are they just querying some server and using the redirect to determine and cache the mapping?

@vazois
Copy link
Contributor Author

vazois commented May 18, 2024

Have you verified that clients are doing this CRC computation, or are they just querying some server and using the redirect to determine and cache the mapping?

I did not check this at first, just assumed this is how they would be doing it. However, I quick search reveals that SERedis is doing something like what I said above because I can see they have their own CRC16 implementation

https://github.com/StackExchange/StackExchange.Redis/blob/61c13c21844ff3e92eb077523dc876688878ba25/src/StackExchange.Redis/ServerSelectionStrategy.cs#L63

@mgravell
Copy link
Contributor

mgravell commented May 18, 2024 via email

@PaulusParssinen
Copy link
Contributor

PaulusParssinen commented May 19, 2024

I think cost of switching the hash function greatly outweights any benefits of slightly faster hashing for the cluster slot given it's apparently not a implementation detail. Also I think naive CRC16 lookup table loop is doing just good enough here given the inputs are small.

Unrelated, speaking of hash code logic, our server side key hash logic is here: https://github.com/microsoft/garnet/blob/main/libs/storage/Tsavorite/cs/src/core/Utilities/Utility.cs#L179

Any way to improve its speed, while maintaining (or improving) the good hash distribution/spread property, would be interesting as well.

This seems to be implementation detail on the other hand and probably could be improved but with good benchmark/profiling data as proof of course.

@vazois vazois merged commit 8ef45f9 into microsoft:main May 20, 2024
23 checks passed
@vazois vazois deleted the vazois/fix-crc16 branch June 4, 2024 19:00
@vazois vazois restored the vazois/fix-crc16 branch June 4, 2024 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants