Alexander Kjeldaas posted a nice cll article about using the SB-SPROF disassembler integration. While reading the disassemblies that Alexander had annotated, I noticed we were using a memory-to-register XCHG instrution on the memory-allocation fast-path for x86-64.

Quote from the message:

;       4A:       41C684249800000000 MOV BYTE PTR [R12+152], 0
                                           (CLEAR INTERRUPT FLAG?)
;       53:       41C684249000000008 MOV BYTE PTR [R12+144], 8
                                           (INDICATE UNINTERRUPTIBLE)
;       5C:       48C7C110000000   MOV RCX, 16
;       63:       49034C2440       ADD RCX, [R12+64]
;       68:       49394C2448       CMP [R12+72], RCX
;       6D:       765B             JBE L2
                                           (CALCULATE NEW END-OF-HEAP
                                            AND SEE IF WE NEED TO
                                            ALLOCATE MOVE MEMORY)
;       6F:       49874C2440       XCHG RCX, [R12+64]
                                           (GET ALLOCATED MEMORY/
                                            UPDATE END-OF-HEAP)

That looks reasonable, right? We want to exchange two values, and XCHG is the shortest way to do that. Unfortunately it's not reasonable. A memory-referencing XCHG has an implicit LOCK prefix, and therefore acquires a lock on the memory system to ensure that the XCHG is atomic. This is very slow compared to executing a simpler instruction, and really not something that should be done every time we want to allocate 16 bytes for a cons cell.

Luckily we don't actually need an atomic exchange here, since the code can't be interrupted due to running inside a pseudo-atomic section, and the data that's being modified is thread-local on threaded SBCL builds. The only reason XCHG is being used is that when the code was written, no spare register was available as a temporary for shuffling the values. Later on other needs for a general scratch register surfaced, and R13 was assigned for this (i.e. the register allocator never assigns any TN to R13). Now that the register is available, getting rid of the XCHG is a three-line change. (Alexander sent a further optimization to the allocation sequence to sbcl-devel, but I haven't looked it yet).

Here's a graph of the CL-BENCH results before and after the change (the versions between 0.9.6.8 and 0.9.6.14 were fixes to esoteric CLOS/MOP bugs, which had no major performance effect according to the automated benchmarks). Y-axis is relative execution time compared to the reference implementation, in this case 0.9.6.8. X-axis is individual CL-BENCH tests, in the arbitrary order that CL-BENCH outputs them. Ideas on better ways to visualize this data are welcome.

Some of the more consy tests get a 25% speedup from this change, which is pretty good for such a tiny change. The 5% speedup on the COMPILER benchmark that I reported in the commit message was actually caused by some unrelated pretty-printer changes in the same tree, so the winnage from this change wasn't quite as major as I thought. I have no idea of what caused the large error bars for .14.

The x86 port currently does all memory allocation by calling to C instead of using this sort of inline allocation, since the inline allocation code caused mysterious slowdowns on Pentium 4. The disabled code also does a memory-referencing XCHG, which might well be the reason for the slowdowns. Getting rid of the XCHG and re-enabling inline allocation might give similar speedups on x86. (This could be a reasonable project for someone who wants to start hacking SBCL internals.)

In other news, I managed to advance the thesis a fair bit this week, and don't need to feel guilty about spending some quality time with SBCL this weekend instead of these small hacks. Some of that time will probably be spent merging stuff from the overflowing "unapplied patches"-mailbox, and the rest in finishing and committing some long-dormant changes from my local trees. The latter assumes I can find the changes; for some reason the good stuff always ends up in trees with names like clean, clean4, really-clean or pristine.