About Us Our Mission Minix RISCV PowerPC Lua XCOFF Bootloader Old World Macs Open Firmware Words Journeys Home Repair Top Level |
RISC-Voptim2.c.patch and match.c.patch are two patches to fix an issue with optimization changing a comparison without changing the template. This only occurs with one-step comparison and branch architectures (RISC-V, MIPS64) where optimization removes all instructions after the comparison and branch. This can occur when one of the operands to be returned is already in the return register. Without optimization the return result will be moved between temp registers; with optimization nothing needs to be done, so the instructions are removed. For example, using the test code from 11/1/22 (gcd), without optimizations, if (a == b) c = a; else { ...is emitted as .L249: .L253: bne a0, a1, .L255 ; rleft (rs1) != rright (rs2) jal x0, .L257 .L255: bge a0, a1, .L259 ; rright (rs2) <= rleft (rs1)With optimizations, the code was being emitted as .L249: .L253: bne a1, a0, .L257 ; rleft (rs1) == rright (rleft) bge a0, a1, .L259 ; rright (rs2) <= rleft (rs1)The jal jump to the epilog (.L257 ) was deleted and the label for the bne was changed to jump there instead of over to the else statement. However, this changes the logic of the operation, such that a will be returned only if a != b . That is the opposite of what we want. Internally, pcc was actually (correctly) changing the comparison in the node. Before optimizing: DEFLAB (0x35f98): .L253 NODE (0x35fb0): 0x2f680) CBRANCH, int, REGAfter optimizing: .L253: emit IP_NODE: 0x2f698) CBRANCH, int, REGThe code calls void gencode(NODE *p, int cookie) { struct optab *q = &table[TBLIDX(p->n_su)]; ... expand(p, cookie, q->cstring); ...in mip/reader.c:1287, but as you can see from the snippet above, the node p and the optab q have lost syncronicity. The cstring comes from the template before optimizing.When the comparison is changed, the template for the node needs to be matched again, to avoid a situation where the matched template is no longer correct. A call to relops() in mip/optim2.c:iterate() does the trick:
if (i < EQ || i - EQ >= (int)negrelsize) comperr("deljumps: unexpected op"); p->dlip->ip_node->n_left->n_op = negrel[i - EQ]; if ((sh = relops(p->dlip->ip_node->n_left)) != 0) comperr("could not rematch op after negrel (sh = 0x%x)\n", sh);Now the emitted template is correct: .L249: .L253: beq a1, a0, .L257 ; rleft (rs1) == rright (rleft) bge a0, a1, .L259 ; rright (rs2) <= rleft (rs1)It doesn't show up in MIPS64 because there is an error that is not treating the return register v0 properly, which results in an extra move between registers. It doesn't show up in powerpc and aarch64 because they are two-step comparison and branch.
I also added a slight change to the mip/match.c code to return the value of relops, instead of defaulting to 0 .11/1/22: Updated code (links below). The updated code fixes some of the inefficiencies previously noted. Code to calculate the greatest common denominator (GCD) of two numbers int gcd ( int a, int b ) { int c; if (a == b) c = a; else { if(a > b) a = a - b; else b = b - a; c = gcd(a,b); } return c; }now looks like: .file "gcd.c" ; r: 0x36128, op: 48, n_left: 0x36168 n_right: 0x36108 last = 0 ; movargs: reg = 10, numregs 7 ; movargs: reg = 11, numregs 6 .text .align 4 .globl gcd gcd: ; p2maxautooff: 0, addto: 0 mv t1, fp mv fp, sp subi sp, sp, 16 sw t1, 0(sp) sw ra, 4(sp) .L249: .L253: bne a0, a1, .L257 ; rleft (rs1) != rright (rs2) bge a0, a1, .L259 ; rright (rs2) <= rleft (rs1) sub a0, a0, a1 ; opsimp .L261: jal ra, gcd ; call scon/sareg mv t0, a0 ; rmove (6) mv a0, t0 ; move between registers jal x0, .L257 .L259: sub a1, a1, a0 ; opsimp jal x0, .L261 .L257: lw ra, 4(sp) lw t1, 0(sp) mv sp, fp mv fp, t1 jr ra .ident "PCC: Portable C Compiler 1.2.0.DEVEL 20200630 for riscv32-unknown-minix"If you know your RISC-V code you will see an unnecessary move to a temporary register as well as an incorrect conditional. I know the cause of the incorrect conditional and I am working to fix it (it is in the pcc optimizer) (Update: fixed in the 11/3/22 post). I know the conditions that lead to the unnecessary move but I have not yet identified the code I need to fix. You might also notice the stack moves 16 bytes but only needs 8. This is because the optimizer allocates the stack before it does the optimization, so when space is not needed, the stack is not corrected. These are issues that are coming out with RISC-V run through the optimizer (-xtemps -xdeljumps, etc) instead of relying on stack-based operations. 10/25/22: Alpha release of RISCV32 C backend for PCC. There is a tarball but you might want to download each individual file because until these are committed to the tree these are moving targets. If I forget to rebuild the tarball it will be out-of-date. There are six files: code.c flocal.c local.c local2.c order.c table.c macdefs.h Here is some sample output: .L761: lw t0, 568(fp) ; sub for soreg tword (right) addi t0, t0, 2 ; zp mv s4, t0 ; move between registers mv a0, s4 ; move between registers jal ra, malloc mv t0, a0 ; rmove mv s3, t0 ; move between registers lw t0, 532(fp) ; sub for soreg tword (right) addi t0, t0, 2 ; zp mv s2, t0 ; move between registers mv a0, s2 ; move between registers jal ra, malloc mv t0, a0 ; rmove mv a1, t0 ; move between registers mr t0, fp ; caught3 addi t0, t0, 256 ; zp mv a0, t0 ; move between registers mv t2, fp ; move between registers lw t0, 2656(fp) ; umul soreg nareg lw t1, 8(t0) ; assign oreg to areg lw t0, 2656(fp) ; umul soreg nareg lw t0, 4(t0) ; assign oreg to areg mv s8, s2 ; move between registers mv s9, s4 ; move between registers mv s10, t0 ; move between registers mv s11, t1 ; move between registers mv s2, t2 ; move between registers mv s4, a0 ; move between registers mv s5, a1 ; move between registers jal x0, .L763 And some simple floating point calculations: .L623: addi t0, t0, 1 ; zp mv a0, t0 ; move between registers jal x0, .L619 .L621: fcvt.d.w ft1, t2 ; aaa fcvt.d.w ft0, s3 ; aaa fdiv.d ft1, ft1, ft0 fld ft0, .L625 ; double from sname (left) fsub.d ft0, ft0, ft1 fmv.d ft1, ft0 flt.d t1, fs1, ft1 beq x0, t1, .L617 add t0, t0, s6 jal x0, .L609 .L634: fmv.d ft1, fs0 .L617: mv a4, s7 ; move between registers mv a3, s8 ; move between registers fld ft0, .L631 ; double from sname (left) fmul.d fa0, ft1, ft0 mv a2, s3 ; move between registers mv a1, s9 ; move between registers lla a0, .L629 ; assign label to areg jal ra, printfThe development environment was a combination of TextWrangler on OS X and Minix-3.1.6. Editing was done in TextWrangler, but the build environment was all Minix 3.1.6. .ident "PCC: Portable C Compiler 1.2.0.DEVEL 20200630 for riscv32-unknown-minix" # riscv32-minix-pcc -v Portable C Compiler 1.2.0.DEVEL 20200630 for riscv32-unknown-minix # file /usr/bin/riscv32-minix-pcc /usr/bin/riscv32-minix-pcc: MINIX-PC 32-bit executable, sep I&D not stripped # pcc -v Portable C Compiler 1.2.0.DEVEL 20190328 for i686-pc-minix3To cross-compile you will need something like old_path=${PATH} export LIBS='-L/usr/lib/pcc -L/usr/lib/pcc/i386 -ld -lc -lfp -le' export YACC=/usr/bin/yacc ./configure --with-libdir=/usr/lib --with-linker='ack_linker_shim' --prefix=/usr \ --with-libexec=/usr/libexec --with-assembler='as_shim' \ CC=pcc CPPFLAGS='-DUSE_YASM -DYYERROR_VERBOSE -DEBX_IS_SCRATCH -D_MINIX -D_POSIX_SOURCE -DPCC_DEBUG' \ --disable-stripping --target=riscv32-minix make clean && make && make install export PATH=${old_path}for the configure parameters. Some caveats: 1) This is only tested against -O3 2) varargs is not yet working Please do not send me bug reports about code that won't compile without -O3. I focused on code that emphasized the register-centric architecture of RISC-V. Yes, some store and load operations might not work. If it also fails with -O3, that is a bug. I have an issue with needing to move the stack pointer at a specific point to implement varargs, but I have been unable to get the tree to cooperate. There are some issues in efficiency that I am hoping is simply because I have not yet matched the templates with enough granularity. For example, in the above sample: lw t0, 568(fp) ; sub for soreg tword (right) addi t0, t0, 2 ; zp mv s4, t0 ; move between registers mv a0, s4 ; move between registers jal ra, malloc mv t0, a0 ; rmove mv s3, t0 ; move between registerswhich corresponds to C code: compare_sz = o_st.st_size+2; compare = malloc(compare_sz);should be two instructions shorter: lw s4, 568(fp) ; sub for soreg tword (right) addi s4, s4, 2 mv a0, s4 jal ra, malloc mv s3, a0 ; rmoveAnother example (not in the above code): li t0, 0 ; sub for scon tword (right) sb t0, 0(fp) ; assign byteshould be sb x0, 0(fp)but I suspect that when I do some wholesale substitutions of scalar conversions I am preventing the tree from seeing this. Additionally, mr t0, fp ; caught3 addi a0, t0, 256 ; zpshould be addi a0, fp, 256but I am not sure exactly the problem here with the OPLTYPE matching or if because this is a two-step optimization I am not looking up the tree as I should be. Within my own development environment, I do not (yet) have a RISCV assembler. Therefore, testing has only gone as far as -S compiling and manual review of the output. As time permits I will run some of the more complex output Assembly through an online emulator. Early tests of basic math operation and other algorithms were successful. There are a lot of comments in the code and the output. Please do not alter them when reporting bugs. To report a bug, please include the line of code, preferable isolated to a single file, and the tree output that failed, which should be in the .s file. The code base I compiled against should be -current, so this should merge. However, I have a bit of trouble with cvs in my development environment (Minix-3.1.6 with a custom-built BSD-licensed toolchain), so I have not updated DATESTAMP and it is possible I have missed something with manual updates. (If you want to contact tim, you know the routine - gtkelly at this domain.) |