Dialectronics RISC-V corner

Dialectronics - the dialect of electronic communications

About Us

Our Mission

Minix

RISCV

PowerPC

Lua

XCOFF Bootloader

Old World Macs

Open Firmware

Words

Journeys

Home Repair

Top Level

RISC-V

11/3/22: Patches to fix pcc optimization in one-step comparison and branch operations

optim2.c.patch and match.c.patch are two patches to fix an issue with optimization changing a comparison without changing the template. This only occurs with one-step comparison and branch architectures (RISC-V, MIPS64) where optimization removes all instructions after the comparison and branch. This can occur when one of the operands to be returned is already in the return register. Without optimization the return result will be moved between temp registers; with optimization nothing needs to be done, so the instructions are removed.

For example, using the test code from 11/1/22 (gcd), without optimizations,

  if (a == b)
        c = a;
  else {
  ...

is emitted as

.L249:
.L253:
        bne a0, a1, .L255       ; rleft (rs1) != rright (rs2)
        jal x0, .L257
.L255:
        bge a0, a1, .L259       ; rright (rs2) <= rleft (rs1)

With optimizations, the code was being emitted as

.L249:
.L253:
        bne a1, a0, .L257       ; rleft (rs1) == rright (rleft)
        bge a0, a1, .L259       ; rright (rs2) <= rleft (rs1)

The jal jump to the epilog (.L257) was deleted and the label for the bne was changed to jump there instead of over to the else statement. However, this changes the logic of the operation, such that a will be returned only if a != b. That is the opposite of what we want.

Internally, pcc was actually (correctly) changing the comparison in the node. Before optimizing:

DEFLAB (0x35f98): .L253
NODE (0x35fb0):
0x2f680) CBRANCH, int, REG , SU= 0(@REG,,,,)
    0x2f6a8) !=, int, REG , SU= 122(@REG,,,,)
        0x2f6e4) REG a0, int, REG a0 , SU= 0(AREG,,,,)
        0x2f70c) REG a1, int, REG a1 , SU= 0(AREG,,,,)
    0x2f734) ICON 255, int, REG , SU= 0(@REG,,,,)

After optimizing:

.L253:
emit IP_NODE:
0x2f698) CBRANCH, int, REG , SU= 0(@REG,,,,)
    0x2f6c0) ==, int, REG , SU= 122(@REG,,,,)
        0x2f6fc) REG a0, int, REG a0 , SU= 0(AREG,,,,)
        0x2f724) REG a1, int, REG a1 , SU= 0(AREG,,,,)
    0x2f74c) ICON 257, int, REG , SU= 0(@REG,,,,)
q->cstring:     nop
q->cstring:     bne AL, AR, LC       ; rleft (rs1) != rright (rs2)

The code calls

void
gencode(NODE *p, int cookie)
{
        struct optab *q = &table[TBLIDX(p->n_su)];
        ...
        expand(p, cookie, q->cstring);
...

in mip/reader.c:1287, but as you can see from the snippet above, the node p and the optab q have lost syncronicity. The cstring comes from the template before optimizing.

When the comparison is changed, the template for the node needs to be matched again, to avoid a situation where the matched template is no longer correct. A call to relops() in mip/optim2.c:iterate() does the trick:

if (i < EQ || i - EQ >= (int)negrelsize)
   comperr("deljumps: unexpected op");
p->dlip->ip_node->n_left->n_op = negrel[i - EQ];
if ((sh = relops(p->dlip->ip_node->n_left)) != 0)
   comperr("could not rematch op after negrel (sh = 0x%x)\n", sh);

Now the emitted template is correct:

.L249:
.L253:
        beq a1, a0, .L257       ; rleft (rs1) == rright (rleft)
        bge a0, a1, .L259       ; rright (rs2) <= rleft (rs1)

It doesn't show up in MIPS64 because there is an error that is not treating the return register v0 properly, which results in an extra move between registers. It doesn't show up in powerpc and aarch64 because they are two-step comparison and branch. I also added a slight change to the mip/match.c code to return the value of relops, instead of defaulting to 0.

11/1/22: Updated code (links below). The updated code fixes some of the inefficiencies previously noted. Code to calculate the greatest common denominator (GCD) of two numbers

int gcd ( int a, int b ) {
  int c;
  if (a == b)
        c = a;
  else {
        if(a > b)
                a = a - b;
        else
                b = b - a;
        c = gcd(a,b);
  }

return c;
}

now looks like:

  
	.file "gcd.c"
; r: 0x36128, op: 48, n_left: 0x36168 n_right: 0x36108 last = 0
; movargs: reg = 10, numregs 7
; movargs: reg = 11, numregs 6
	.text
	.align 4
	.globl gcd
gcd:

;	p2maxautooff: 0, addto: 0
	mv t1, fp
	mv fp, sp
	subi sp, sp, 16
	sw t1, 0(sp)
	sw ra, 4(sp)
.L249:
.L253:
	bne a0, a1, .L257       ; rleft (rs1) != rright (rs2)
	bge a0, a1, .L259       ; rright (rs2) <= rleft (rs1) 
	sub a0, a0, a1	       ; opsimp
.L261:
	jal ra, gcd	       ; call scon/sareg
	mv t0, a0       ; rmove (6)
	mv a0, t0	       ; move between registers
	jal x0, .L257
.L259:
	sub a1, a1, a0	       ; opsimp
	jal x0, .L261
.L257:
	lw ra, 4(sp)
	lw t1, 0(sp)
	mv sp, fp
	mv fp, t1
	jr ra
	.ident "PCC: Portable C Compiler 1.2.0.DEVEL 20200630 for riscv32-unknown-minix"

If you know your RISC-V code you will see an unnecessary move to a temporary register as well as an incorrect conditional. I know the cause of the incorrect conditional and I am working to fix it (it is in the pcc optimizer) (Update: fixed in the 11/3/22 post). I know the conditions that lead to the unnecessary move but I have not yet identified the code I need to fix. You might also notice the stack moves 16 bytes but only needs 8. This is because the optimizer allocates the stack before it does the optimization, so when space is not needed, the stack is not corrected. These are issues that are coming out with RISC-V run through the optimizer (-xtemps -xdeljumps, etc) instead of relying on stack-based operations.

10/25/22: Alpha release of RISCV32 C backend for PCC.

There is a tarball but you might want to download each individual file because until these are committed to the tree these are moving targets. If I forget to rebuild the tarball it will be out-of-date.

There are six files:

code.c
flocal.c
local.c
local2.c
order.c
table.c
macdefs.h

Here is some sample output:

.L761:
        lw t0, 568(fp)         ; sub for soreg tword (right)
        addi t0, t0, 2        ; zp
        mv s4, t0              ; move between registers
        mv a0, s4              ; move between registers
        jal ra, malloc
        mv t0, a0       ; rmove
        mv s3, t0              ; move between registers
        lw t0, 532(fp)         ; sub for soreg tword (right)
        addi t0, t0, 2        ; zp
        mv s2, t0              ; move between registers     
        mv a0, s2              ; move between registers
        jal ra, malloc
        mv t0, a0       ; rmove   
        mv a1, t0              ; move between registers    
        mr t0, fp              ; caught3
        addi t0, t0, 256        ; zp
        mv a0, t0              ; move between registers
        mv t2, fp              ; move between registers
        lw t0, 2656(fp)        ; umul soreg nareg
        lw t1, 8(t0)           ; assign oreg to areg
        lw t0, 2656(fp)        ; umul soreg nareg      
        lw t0, 4(t0)           ; assign oreg to areg     
        mv s8, s2              ; move between registers
        mv s9, s4              ; move between registers
        mv s10, t0             ; move between registers
        mv s11, t1             ; move between registers
        mv s2, t2              ; move between registers
        mv s4, a0              ; move between registers
        mv s5, a1              ; move between registers
        jal x0, .L763

And some simple floating point calculations:

.L623:
        addi t0, t0, 1        ; zp
        mv a0, t0              ; move between registers
        jal x0, .L619
.L621:
        fcvt.d.w ft1, t2       ; aaa
        fcvt.d.w ft0, s3       ; aaa
        fdiv.d ft1, ft1, ft0
        fld ft0, .L625         ; double from sname (left)
        fsub.d ft0, ft0, ft1
        fmv.d ft1, ft0
        flt.d t1, fs1, ft1
        beq x0, t1, .L617
        add t0, t0, s6
        jal x0, .L609
.L634:
        fmv.d ft1, fs0
.L617:
        mv a4, s7              ; move between registers
        mv a3, s8              ; move between registers
        fld ft0, .L631         ; double from sname (left)
        fmul.d fa0, ft1, ft0
        mv a2, s3              ; move between registers
        mv a1, s9              ; move between registers
        lla a0, .L629          ; assign label to areg
        jal ra, printf

The development environment was a combination of TextWrangler on OS X and Minix-3.1.6. Editing was done in TextWrangler, but the build environment was all Minix 3.1.6.

.ident "PCC: Portable C Compiler 1.2.0.DEVEL 20200630 for riscv32-unknown-minix"

# riscv32-minix-pcc -v
Portable C Compiler 1.2.0.DEVEL 20200630 for riscv32-unknown-minix

# file /usr/bin/riscv32-minix-pcc
/usr/bin/riscv32-minix-pcc: MINIX-PC 32-bit executable, sep I&D not stripped

# pcc -v
Portable C Compiler 1.2.0.DEVEL 20190328 for i686-pc-minix3

To cross-compile you will need something like

old_path=${PATH}
export LIBS='-L/usr/lib/pcc -L/usr/lib/pcc/i386 -ld -lc -lfp -le'
export YACC=/usr/bin/yacc
./configure --with-libdir=/usr/lib --with-linker='ack_linker_shim' --prefix=/usr \
        --with-libexec=/usr/libexec --with-assembler='as_shim' \
        CC=pcc CPPFLAGS='-DUSE_YASM -DYYERROR_VERBOSE -DEBX_IS_SCRATCH -D_MINIX -D_POSIX_SOURCE -DPCC_DEBUG' \
        --disable-stripping --target=riscv32-minix
make clean && make && make install
export PATH=${old_path}

for the configure parameters.

Some caveats:

1) This is only tested against -O3

2) varargs is not yet working

Please do not send me bug reports about code that won't compile without -O3. I focused on code that emphasized the register-centric architecture of RISC-V. Yes, some store and load operations might not work. If it also fails with -O3, that is a bug.

I have an issue with needing to move the stack pointer at a specific point to implement varargs, but I have been unable to get the tree to cooperate.

There are some issues in efficiency that I am hoping is simply because I have not yet matched the templates with enough granularity. For example, in the above sample:

	lw t0, 568(fp)			; sub for soreg tword (right)
	addi t0, t0, 2			; zp
	mv s4, t0			; move between registers
	mv a0, s4			; move between registers
	jal ra, malloc
	mv t0, a0			; rmove
	mv s3, t0			; move between registers

which corresponds to C code:

compare_sz = o_st.st_size+2;
compare = malloc(compare_sz);

should be two instructions shorter:

	lw s4, 568(fp)			; sub for soreg tword (right)
	addi s4, s4, 2
	mv a0, s4
	jal ra, malloc
	mv s3, a0			; rmove

Another example (not in the above code):

	li t0, 0			; sub for scon tword (right)
	sb t0, 0(fp)			; assign byte

should be

	sb x0, 0(fp)

but I suspect that when I do some wholesale substitutions of scalar conversions I am preventing the tree from seeing this.

Additionally,

	mr t0, fp			; caught3
	addi a0, t0, 256        ; zp

should be

	addi a0, fp, 256

but I am not sure exactly the problem here with the OPLTYPE matching or if because this is a two-step optimization I am not looking up the tree as I should be.

Within my own development environment, I do not (yet) have a RISCV assembler. Therefore, testing has only gone as far as -S compiling and manual review of the output. As time permits I will run some of the more complex output Assembly through an online emulator. Early tests of basic math operation and other algorithms were successful.

There are a lot of comments in the code and the output. Please do not alter them when reporting bugs. To report a bug, please include the line of code, preferable isolated to a single file, and the tree output that failed, which should be in the .s file.

The code base I compiled against should be -current, so this should merge. However, I have a bit of trouble with cvs in my development environment (Minix-3.1.6 with a custom-built BSD-licensed toolchain), so I have not updated DATESTAMP and it is possible I have missed something with manual updates.

(If you want to contact tim, you know the routine - gtkelly at this domain.)