Mix And Crawl: A worse reason for not mixing integer types#
Code: bit-bcast/mix-and-crawl
Mixing integer types is annoying and error prone. What I didn’t know is that it can also prevent vectorization.
Consider the following:
int idx(int i, int j, int m) {
return i*m + j;
}because you know that the matrix is small-ish you chose int instead of
size_t. Maybe you heard somewhere that signed integers are faster.
Next, you write the equivalent of:
double sum(double const* x, size_t n, size_t m) {
double s = 0.0;
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < m; j++) {
s += x[idx(i, j, m)];
}
}
return s;
}which we know can’t vectorize because strict adherence to floating point math;
and unlike real numbers, floating point numbers aren’t associative (a + b) + c != a + (b + c). Easy:
double sum1(double const* x, size_t n, size_t m) {
std::array<double, 4> s{};
for (size_t i = 0; i < n; i++) {
for (size_t j = 0; j < m; j += 4) {
s[0] += x[idx(i, j+0, m)];
s[1] += x[idx(i, j+1, m)];
s[2] += x[idx(i, j+2, m)];
s[3] += x[idx(i, j+3, m)];
}
}
return s[0] + s[1] + s[2] + s[3];
}Beautiful! Except, muscle memory led to typing size_t a couple of times
instead of int. It’s not really a problem because size_t > int and we know
that the matrix is small enough for ints. The compiler doesn’t warn when
using -Wall -Wextra -pedantic. So we should be good to go!
I was preparing slide code for an interview and I knew the above worked, because I’d checked it earlier1. I’d already written on the slide that “obviously” this version vectorized, and wanted to move on to measuring the number of SIMD operations.
Except, it doesn’t:
0000000000000000 <_Z4sum1PKdmm>:
0: test rsi,rsi
3: mov r10,rsi
6: mov r11,rdx
9: je 8e <_Z4sum1PKdmm+0x8e>
f: mov rsi,rdx
12: shr rsi,0x2
16: je 93 <_Z4sum1PKdmm+0x93>
18: vxorpd xmm1,xmm1,xmm1
1c: xor r9d,r9d
1f: xor r8d,r8d
22: vmovapd xmm2,xmm1
26: vmovapd xmm0,xmm1
2a: vmovapd xmm3,xmm1
2e: xchg ax,ax
30: mov eax,r9d
33: xor ecx,ecx
40: movsxd rdx,eax
43: add rcx,0x1
47: vaddsd xmm3,xmm3,QWORD PTR [rdi+rdx*8]
4c: lea edx,[rax+0x1]
4f: movsxd rdx,edx
52: vaddsd xmm0,xmm0,QWORD PTR [rdi+rdx*8]
57: lea edx,[rax+0x2]
5a: movsxd rdx,edx
5d: vaddsd xmm2,xmm2,QWORD PTR [rdi+rdx*8]
62: lea edx,[rax+0x3]
65: add eax,0x4
68: cmp rcx,rsi
6b: movsxd rdx,edx
6e: vaddsd xmm1,xmm1,QWORD PTR [rdi+rdx*8]
73: jne 40 <_Z4sum1PKdmm+0x40>
75: add r8,0x1
79: add r9d,r11d
7c: cmp r10,r8
7f: jne 30 <_Z4sum1PKdmm+0x30>
81: vaddsd xmm0,xmm0,xmm3
85: vaddsd xmm0,xmm0,xmm2
89: vaddsd xmm0,xmm0,xmm1
8d: ret
8e: vxorpd xmm0,xmm0,xmm0
92: ret
93: vxorpd xmm1,xmm1,xmm1
97: vmovapd xmm2,xmm1
9b: vmovapd xmm0,xmm1
9f: vmovapd xmm3,xmm1
a3: jmp 81 <_Z4sum1PKdmm+0x81>There’s four separate additions:
52: vaddsd xmm0,xmm0,QWORD PTR [rdi+rdx*8]Note that it’s using vaddsd (s for scalar; not p for packed). Also notice
that it’s doing some really pointless integer operations:
30: mov eax,r9d # for each i-iteration
40: movsxd rdx,eax
4c: lea edx,[rax+0x1]
4f: movsxd rdx,edx
57: lea edx,[rax+0x2]
5a: movsxd rdx,edx
62: lea edx,[rax+0x3]
6b: movsxd rdx,edx
65: add eax,0x4 # for each j-iterationTherefore, the issue isn’t that it doesn’t “understand” idx, because it’s
seen right though and realized that:
idx(i, j+k, m) == idx(i, j, m) + kIt then also correctly hoists idx(i, j, m) out of the inner-most loop.
What’s odd is that
4c: lea edx,[rax+0x1]
4f: movsxd rdx,edxtranslates to:
- Increment the 64-bit address
raxby one; and store the lower 32 bits inedx. - Sign-extent the 32-bit register
edxinto the 64-bit registerrdx.
In pictures (with 4- and 8-bit integers):
lea edx,[rax+0x1]
rax: 0000 0110
edx: 0111
movsxd rdx,edx
rdx: 0000 0111
lea edx,[rax+0x2]
rax: 0000 0110
edx: 1000
movsxd rdx,edx
rdx: 1111 1000It’s odd, because signed integer overlow is UB. Hence, the optimizer is free to
assume it never happens. Therefore, my guess is likely inaccurate and there’s
some additional subtlety that’s preventing the optimizer from knowing that it’s
accessing four consequtive doubles, i.e. there’s not “wrapping around” and no
need for sign extention.
The solution is to not mix integer types: either by using int for the loop
variable or by rewriting idx to use size_t. Let’s do the former:
double sum2(double const* x, int n, int m) {
std::array<double, 4> s{};
for (int i = 0; i < n; i++) {
for (int j = 0; j < m; j += 4) {
s[0] += x[idx(i, j+0, m)];
s[1] += x[idx(i, j+1, m)];
s[2] += x[idx(i, j+2, m)];
s[3] += x[idx(i, j+3, m)];
}
}
return s[0] + s[1] + s[2] + s[3];
}which results in:
00000000000000b0 <_Z4sum2PKdii>:
b0: test esi,esi
b2: mov r10,rdi
b5: mov r9d,esi
b8: mov r8d,edx
bb: jle 12e <_Z4sum2PKdii+0x7e>
bd: cmp edx,0x3
c0: jle 133 <_Z4sum2PKdii+0x83>
c2: test edx,edx
c4: lea edi,[rdx+0x3]
c7: vxorpd xmm0,xmm0,xmm0
cb: cmovns edi,edx
ce: xor esi,esi
d0: xor ecx,ecx
d2: sar edi,0x2
d5: shl rdi,0x5
e0: movsxd rax,ecx
e3: lea rax,[r10+rax*8]
e7: lea rdx,[rax+rdi*1]
f0: vaddpd ymm0,ymm0,YMMWORD PTR [rax]
f4: add rax,0x20
f8: cmp rdx,rax
fb: jne f0 <_Z4sum2PKdii+0x40>
fd: add esi,0x1
100: add ecx,r8d
103: cmp r9d,esi
106: jne e0 <_Z4sum2PKdii+0x30>
108: vmovapd xmm1,xmm0
10c: vunpckhpd xmm2,xmm0,xmm0
110: vextractf128 xmm0,ymm0,0x1
116: vmovapd xmm3,xmm0
11a: vunpckhpd xmm0,xmm0,xmm0
11e: vzeroupper
121: vaddsd xmm1,xmm1,xmm2
125: vaddsd xmm1,xmm1,xmm3
129: vaddsd xmm0,xmm1,xmm0
12d: ret
12e: vxorpd xmm0,xmm0,xmm0
132: ret
133: vxorpd xmm0,xmm0,xmm0
137: vmovapd xmm3,xmm0
13b: vmovapd xmm2,xmm0
13f: vmovapd xmm1,xmm0
143: jmp 121 <_Z4sum2PKdii+0x71>Which is very nice, since now we have additional motivation to turn on
-Wsign-conversion and gratefully tolerate the relentless complaints since we
know it just might make the code run a little faster =) Not sure if it’s a
better reason2; but it could be an effective motivator.
The precise setting for the slides involved
mdspanwithindex_type == intand the crucial thing to check is thatmdspan::operator[]is “transparent” to optimizer and doesn’t prevent auto-vectorization. Some hastily written slide code usedsize_tunlike earlier checks which usedauto. ↩︎Nobody likes silent value-changing conversions; and 32-bit
ints really aren’t that big (so it’s eventually that small-ish matrix won’t be small enough). ↩︎