Mix And Crawl: A worse reason for not mixing integer types#

Mixing integer types is annoying and error prone. What I didn’t know is that it can also prevent vectorization.

Consider the following:

int idx(int i, int j, int m) {
    return i*m + j;
}

because you know that the matrix is small-ish you chose int instead of size_t. Maybe you heard somewhere that signed integers are faster.

Next, you write the equivalent of:

double sum(double const* x, size_t n, size_t m) {
    double s = 0.0;
    for (size_t i = 0; i < n; i++) {
        for (size_t j = 0; j < m; j++) {
            s += x[idx(i, j, m)];
        }
    }
    return s;
}

which we know can’t vectorize because strict adherence to floating point math; and unlike real numbers, floating point numbers aren’t associative (a + b) + c != a + (b + c). Easy:

double sum1(double const* x, size_t n, size_t m) {
    std::array<double, 4> s{};
    for (size_t i = 0; i < n; i++) {
        for (size_t j = 0; j < m; j += 4) {
            s[0] += x[idx(i, j+0, m)];
            s[1] += x[idx(i, j+1, m)];
            s[2] += x[idx(i, j+2, m)];
            s[3] += x[idx(i, j+3, m)];
        }
    }
    return s[0] + s[1] + s[2] + s[3];
}

Beautiful! Except, muscle memory led to typing size_t a couple of times instead of int. It’s not really a problem because size_t > int and we know that the matrix is small enough for ints. The compiler doesn’t warn when using -Wall -Wextra -pedantic. So we should be good to go!

I was preparing slide code for an interview and I knew the above worked, because I’d checked it earlier¹. I’d already written on the slide that “obviously” this version vectorized, and wanted to move on to measuring the number of SIMD operations.

Except, it doesn’t:

0000000000000000 <_Z4sum1PKdmm>:
   0:	test   rsi,rsi
   3:	mov    r10,rsi
   6:	mov    r11,rdx
   9:	je     8e <_Z4sum1PKdmm+0x8e>
   f:	mov    rsi,rdx
  12:	shr    rsi,0x2
  16:	je     93 <_Z4sum1PKdmm+0x93>
  18:	vxorpd xmm1,xmm1,xmm1
  1c:	xor    r9d,r9d
  1f:	xor    r8d,r8d
  22:	vmovapd xmm2,xmm1
  26:	vmovapd xmm0,xmm1
  2a:	vmovapd xmm3,xmm1
  2e:	xchg   ax,ax
  30:	mov    eax,r9d
  33:	xor    ecx,ecx
  40:	movsxd rdx,eax
  43:	add    rcx,0x1
  47:	vaddsd xmm3,xmm3,QWORD PTR [rdi+rdx*8]
  4c:	lea    edx,[rax+0x1]
  4f:	movsxd rdx,edx
  52:	vaddsd xmm0,xmm0,QWORD PTR [rdi+rdx*8]
  57:	lea    edx,[rax+0x2]
  5a:	movsxd rdx,edx
  5d:	vaddsd xmm2,xmm2,QWORD PTR [rdi+rdx*8]
  62:	lea    edx,[rax+0x3]
  65:	add    eax,0x4
  68:	cmp    rcx,rsi
  6b:	movsxd rdx,edx
  6e:	vaddsd xmm1,xmm1,QWORD PTR [rdi+rdx*8]
  73:	jne    40 <_Z4sum1PKdmm+0x40>
  75:	add    r8,0x1
  79:	add    r9d,r11d
  7c:	cmp    r10,r8
  7f:	jne    30 <_Z4sum1PKdmm+0x30>
  81:	vaddsd xmm0,xmm0,xmm3
  85:	vaddsd xmm0,xmm0,xmm2
  89:	vaddsd xmm0,xmm0,xmm1
  8d:	ret
  8e:	vxorpd xmm0,xmm0,xmm0
  92:	ret
  93:	vxorpd xmm1,xmm1,xmm1
  97:	vmovapd xmm2,xmm1
  9b:	vmovapd xmm0,xmm1
  9f:	vmovapd xmm3,xmm1
  a3:	jmp    81 <_Z4sum1PKdmm+0x81>

There’s four separate additions:

  52:	vaddsd xmm0,xmm0,QWORD PTR [rdi+rdx*8]

Note that it’s using vaddsd (s for scalar; not p for packed). Also notice that it’s doing some really pointless integer operations:

  30:	mov    eax,r9d  # for each i-iteration

  40:	movsxd rdx,eax

  4c:	lea    edx,[rax+0x1]
  4f:	movsxd rdx,edx

  57:	lea    edx,[rax+0x2]
  5a:	movsxd rdx,edx

  62:	lea    edx,[rax+0x3]
  6b:	movsxd rdx,edx

  65:	add    eax,0x4  # for each j-iteration

Therefore, the issue isn’t that it doesn’t “understand” idx, because it’s seen right though and realized that:

idx(i, j+k, m) == idx(i, j, m) + k

It then also correctly hoists idx(i, j, m) out of the inner-most loop.

What’s odd is that

  4c:	lea    edx,[rax+0x1]
  4f:	movsxd rdx,edx

translates to:

Increment the 64-bit address rax by one; and store the lower 32 bits in edx.
Sign-extent the 32-bit register edx into the 64-bit register rdx.

In pictures (with 4- and 8-bit integers):

lea edx,[rax+0x1]
      rax:  0000 0110
      edx:       0111
movsxd rdx,edx
      rdx:  0000 0111

lea edx,[rax+0x2]
      rax:  0000 0110
      edx:       1000
movsxd rdx,edx
      rdx:  1111 1000

It’s odd, because signed integer overlow is UB. Hence, the optimizer is free to assume it never happens. Therefore, my guess is likely inaccurate and there’s some additional subtlety that’s preventing the optimizer from knowing that it’s accessing four consequtive doubles, i.e. there’s not “wrapping around” and no need for sign extention.

The solution is to not mix integer types: either by using int for the loop variable or by rewriting idx to use size_t. Let’s do the former:

double sum2(double const* x, int n, int m) {
    std::array<double, 4> s{};
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < m; j += 4) {
            s[0] += x[idx(i, j+0, m)];
            s[1] += x[idx(i, j+1, m)];
            s[2] += x[idx(i, j+2, m)];
            s[3] += x[idx(i, j+3, m)];
        }
    }
    return s[0] + s[1] + s[2] + s[3];
}

which results in:

00000000000000b0 <_Z4sum2PKdii>:
  b0:	test   esi,esi
  b2:	mov    r10,rdi
  b5:	mov    r9d,esi
  b8:	mov    r8d,edx
  bb:	jle    12e <_Z4sum2PKdii+0x7e>
  bd:	cmp    edx,0x3
  c0:	jle    133 <_Z4sum2PKdii+0x83>
  c2:	test   edx,edx
  c4:	lea    edi,[rdx+0x3]
  c7:	vxorpd xmm0,xmm0,xmm0
  cb:	cmovns edi,edx
  ce:	xor    esi,esi
  d0:	xor    ecx,ecx
  d2:	sar    edi,0x2
  d5:	shl    rdi,0x5
  e0:	movsxd rax,ecx
  e3:	lea    rax,[r10+rax*8]
  e7:	lea    rdx,[rax+rdi*1]
  f0:	vaddpd ymm0,ymm0,YMMWORD PTR [rax]
  f4:	add    rax,0x20
  f8:	cmp    rdx,rax
  fb:	jne    f0 <_Z4sum2PKdii+0x40>
  fd:	add    esi,0x1
 100:	add    ecx,r8d
 103:	cmp    r9d,esi
 106:	jne    e0 <_Z4sum2PKdii+0x30>
 108:	vmovapd xmm1,xmm0
 10c:	vunpckhpd xmm2,xmm0,xmm0
 110:	vextractf128 xmm0,ymm0,0x1
 116:	vmovapd xmm3,xmm0
 11a:	vunpckhpd xmm0,xmm0,xmm0
 11e:	vzeroupper
 121:	vaddsd xmm1,xmm1,xmm2
 125:	vaddsd xmm1,xmm1,xmm3
 129:	vaddsd xmm0,xmm1,xmm0
 12d:	ret
 12e:	vxorpd xmm0,xmm0,xmm0
 132:	ret
 133:	vxorpd xmm0,xmm0,xmm0
 137:	vmovapd xmm3,xmm0
 13b:	vmovapd xmm2,xmm0
 13f:	vmovapd xmm1,xmm0
 143:	jmp    121 <_Z4sum2PKdii+0x71>

Which is very nice, since now we have additional motivation to turn on -Wsign-conversion and gratefully tolerate the relentless complaints since we know it just might make the code run a little faster =) Not sure if it’s a better reason²; but it could be an effective motivator.

The precise setting for the slides involved mdspan with index_type == int and the crucial thing to check is that mdspan::operator[] is “transparent” to optimizer and doesn’t prevent auto-vectorization. Some hastily written slide code used size_t unlike earlier checks which used auto. ↩︎
Nobody likes silent value-changing conversions; and 32-bit ints really aren’t that big (so it’s eventually that small-ish matrix won’t be small enough). ↩︎