Builtin Atomic Operation in GCC on Different ARM Arch Platform

ARMv6 provided a pair of synchronisation primitives, LDREX and STREX in the ARM instruction set. These instructions are used to implement atomic operations. But please note that ARMv6 doesn’t support one byte exclusive operations. It begins to support byte level exclusive operations from ARMv7 and ARMv6_K/ARMv6_ZK.

To implement an atomic operation, you can write it in assembly code by yourself. Like this.

locked    EQU 1
unlocked  EQU 0

; lock_mutex
; Declare for use from C as extern void lock_mutex(void * mutex);
    EXPORT lock_mutex
lock_mutex PROC
    LDR     r1, =locked
1   LDREX   r2, [r0]
    CMP     r2, r1        ; Test if mutex is locked or unlocked
	BEQ     %f2           ; If locked - wait for it to be released, from 2
    STREXNE r2, r1, [r0]  ; Not locked, attempt to lock it
    CMPNE   r2, #1        ; Check if Store-Exclusive failed
    BEQ     %b1           ; Failed - retry from 1
    ; Lock acquired
    DMB                   ; Required before accessing protected resource
    BX      lr

2   ; Take appropriate action while waiting for mutex to become unlocked
    WAIT_FOR_UPDATE
    B       %b1           ; Retry from 1
    ENDP


; unlock_mutex
; Declare for use from C as extern void unlock_mutex(void * mutex);
    EXPORT unlock_mutex
unlock_mutex PROC
    LDR     r1, =unlocked
    DMB                   ; Required before releasing protected resource
    STR     r1, [r0]      ; Unlock mutex
    SIGNAL_UPDATE	
    BX      lr
    ENDP
    

Or you can leverage on GNU GCC’s builtin functions __atomic_xxx (GCC version >= 4.7.0) if you are happen to use GNU GCC. But there is a tricky problem if you decide to use GCC for this.

As we know, usually, 32-bit GNU GCC supports a lot of arch from armv2 to armv8-a. So we must be careful of which arch you are setting to compiler. For example, if you are using armv6 as the arch to compile the code of signle byte atomic incr operation, GCC will link that function to legacy __sync_xxx functions. These functions don’t use LDREX/STREX at all!!!

Let’s check below sample. It will atomic increase 1 to the data stored at address 0x80000000.


__atomic_add_fetch((char *)0x80000000, (unsigned int)1, __ATOMIC_SEQ_CST);

On my Raspberry Pi 3B modle (ARMv8-a 32-bit OS), the default arch of GCC is armv6.

[user]% gcc -Q --help=target | grep arch  
  -march=  armv6
  

So above code will be compiled as below assembly code by toolchain on the target natively.

[user]% gcc sample.c
[user]% objdump -D sample.out 

000103e8 <main>:
   103e8:       e92d4800        push    {fp, lr}
   103ec:       e28db004        add     fp, sp, #4
   103f0:       e3a03102        mov     r3, #-2147483648        ; 0x80000000
   103f4:       e1a00003        mov     r0, r3
   103f8:       e3a01001        mov     r1, #1
   103fc:       eb000265        bl      10d98 <__sync_add_and_fetch_1>
   10400:       e3a03000        mov     r3, #0
   10404:       e1a00003        mov     r0, r3
   10408:       e8bd8800        pop     {fp, pc}

000107a4 <__sync_fetch_and_add_1>:
  107a4:       e92d47f0        push    {r4, r5, r6, r7, r8, r9, sl, lr}
  107a8:       e200a003        and     sl, r0, #3
  107ac:       e3a050ff        mov     r5, #255        ; 0xff
  107b0:       e1a0a18a        lsl     sl, sl, #3
  107b4:       e59f8040        ldr     r8, [pc, #64]   ; 107fc <__sync_fetch_and_add_1+0x58>
  107b8:       e1a05a15        lsl     r5, r5, sl
  107bc:       e1a07001        mov     r7, r1
  107c0:       e3c09003        bic     r9, r0, #3
  107c4:       e1e06005        mvn     r6, r5
  107c8:       e5990000        ldr     r0, [r9]
  107cc:       e1a02009        mov     r2, r9
  107d0:       e0004005        and     r4, r0, r5
  107d4:       e0061000        and     r1, r6, r0
  107d8:       e1a04a34        lsr     r4, r4, sl
  107dc:       e0843007        add     r3, r4, r7
  107e0:       e0053a13        and     r3, r5, r3, lsl sl
  107e4:       e1831001        orr     r1, r3, r1
  107e8:       e12fff38        blx     r8
  107ec:       e3500000        cmp     r0, #0
  107f0:       1afffff4        bne     107c8 <__sync_fetch_and_add_1+0x24>
  107f4:       e6af0074        sxtb    r0, r4
  107f8:       e8bd87f0        pop     {r4, r5, r6, r7, r8, r9, sl, pc}
  107fc:       ffff0fc0                        ; <UNDEFINED> instruction: 0xffff0fc0
  

But if you specify GCC arch as armv7 or armv8-a, you will get what you expected.

[user]% gcc -march=armv8-a sample.c
[user]% objdump -D sample.out 

000103e8 <main>:
   103e8:       e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
   103ec:       e28db000        add     fp, sp, #0
   103f0:       e3a03102        mov     r3, #-2147483648        ; 0x80000000
   103f4:       e1d32e9f        ldaexb  r2, [r3]
   103f8:       e2822001        add     r2, r2, #1
   103fc:       e1c31e92        stlexb  r1, r2, [r3]
   10400:       e3510000        cmp     r1, #0
   10404:       1afffffa        bne     103f4 <main+0xc>
   10408:       e3a03000        mov     r3, #0
   1040c:       e1a00003        mov     r0, r3
   10410:       e24bd000        sub     sp, fp, #0
   10414:       e49db004        pop     {fp}            ; (ldr fp, [sp], #4) 
   10418:       e12fff1e        bx      lr
   

That’s very tricky. Because you are compiling your code on a ARMv8 Cortex A53 node with native toolchain. But it is actually building a ARMv6 code for you. So please be careful when you build code on ARM platform natively. Set -march=xxx all the time to make sure GCC know which platform it is working on.