As an antidote to list traffic about teething problems in FB^3, consider the simple benchmark below. It is modified from "Examples:Neat Apps:FB II vs FB^3 Example" on the FB^3 CD. The results give cause for celebration (champagne and cigars for Staz'n'Andy), as well as food for thought.
'--------Simple floating point benchMark-------
register off
DIM &&,x#,y#,z#,t#
register on
dim i&,t&
x# = 12345678.90123456789
y# = 123: z# = .01: t# = 100
t& = FN TICKCOUNT
FOR i&=1 TO 10000000: x#=x#+y#*z#-y#/t#: NEXT
PRINT FN TICKCOUNT-t&;" ticks, x#=";x#
'-----------------------------------------------
Results from iMac (233 MHz G3):
| Time to nearest 10 ticks | MFLOPS |
| FB2 | 11870 | 0.2 |
| FB^3 68K (1) | 3920 | 0.6 |
| FB^3 PPC (2) | 3870 | 0.6 |
| PPC ASM (2) | 710 | 3.4 |
| FB^3 PPC (3) | 230 | 10.4 |
| FB^3 PPC (4) | 230 | 10.4 |
| PPC ASM (3) | 100 | 24 |
| PPC ASM (4) | 100 | 24 |
1. Unaffected by alignment of variables
2. Variables aligned on 2-byte boundary but not 4
3. Variables aligned on 4-byte boundary but not 8
4. Variables aligned on 8-byte boundary
Explanation to Note (2)
Misalignment can be forced by the following DIM statement:
DIM &&,silly%,x#,y#,z#,t#
Explanation to Notes (3) and (4)
DIM &&,x#,y#,z#,t# sometimes fails to produce 8-byte alignment, through an anomaly reported to Staz. On some processors, though not the G3, this could slow the performance.
Assembly stuff, for bold explorers:
'-------------PPC ASM equivalent------
//FOR i&=1 TO 10000000: x#=x#+y#*z#-y#/t#: NEXT
countFP&=10000000
` lwz r3,^countFP& ; r3=10000000
` lfd f1,^x#
` lfd f2,^y#
` lfd f3,^z#
` lfd f4,^y#
` lfd f5,^t#
`loopFP
` fdiv f0,f4,f5 ; f0=f4/f5
` addic. r3,r3,$FFFF ; r3=r3-1 (subic. r3,r3,1)
` fmadd f1,f2,f3,f1 ; f1=f2*f3+f1
` fsub f1,f1,f0 ; f1=f1-f0
` stfd f1,^x# ; save f1->x#
` bc 4,2,loopFP ; bne loopFP
'--------------------------------------
Corresponding to the floating point benchmarks posted recently, here are integer benchmark results (iMac 266MHz G3), using a modification of the Sieve of Eratosthenes program in "Examples:Neat Apps:FB II vs FB^3 Example" on the FB^3 CD.
Even in 68K, FB^3 is measurably faster than FB2. PPC native code is 2-3 times faster again. Tuning (by careful application of REGISTER ON) gives a moderate but worthwile improvement. Lastly, there is scope for the speed-hungry assembly programmer. Further congratulations to Staz'n'Andy seem in order.
The overall pattern is similar to that of the floating point benchmark. In that benchmark, however, REGISTER had no effect and tuning was merely a matter of avoiding disastrous variable misaligment.
| Time in ticks (1/60s) | Notes |
| FB2 | 355 | |
| FB^3 68K | 322 | 1 All RAM variables |
| FB^3 68K | 311 | 2 Some REGISTER variables |
| FB^3 68K | 276 | 3 All REGISTER variables |
| FB^3 PPC | 125 | 1 All RAM variables |
| FB^3 PPC | 116 | 2 Some REGISTER variables |
| FB^3 PPC | 93 | 3 All REGISTER variables |
| PPC ASM | 29 | |
Note (1). REGISTER ON changed to REGISTER OFF
Note (2). In the original program, two variables (i and k) were unskilfully DIMmed in such a way that they cannot be REGISTER:-
DIM &&,i,k///Align Vars to even address <-- misguided
DIM f(8191)
DIM t&,loops,c,p
In fact no alignment directive is needed at all; the compiler always uses an even address. Finally, the && alignment directive (supposedly 8-byte) is larger than needed for integer (2-byte) variables.
Note (3). As in listing below.
'-----------Integer BenchMark--------------
LOCAL FN doSieveFB
REGISTER ON ' with compiler preferences register variables ON, too
DIM i, k, loops, c, p, f(8191), t&
t& = FN TICKCOUNT
FOR loops = 1 TO 1000
c = 0
FOR i = 0 TO 8191
f(i) = 1
NEXT i
FOR i = 0 TO 8191
LONG IF f(i) <> 0
p=i+i+3
LONG IF i+p <= 8191
FOR k = i+p TO 8191 STEP p
f(k)=0
NEXT k
END IF
c = c+1
END IF
NEXT i
NEXT loops
t&=FN TICKCOUNT-t&
PRINT c;" primes "; t&;" ticks"
END FN
'-------------------------------------------
'-----------Assembly equivalent-------------
#IF cpuPPC
LOCAL FN doSieveAssembler
REGISTER OFF ' disable because we need addresses of variables
DIM &, t&, fPtr&, i, k, loops, c, p, f(8191)
REGISTER ON
t& = FN TICKCOUNT
fPtr&=@f(0)
FOR loops = 1 TO 1000
` addi r6,0,0; c = 0
` lwz r9,^fPtr& ; address of f(0)
` addi r4,0,2 ; 2
` subf r11,r4,r9 ; address-2 for sthux
` addi r5,0,8192 ; loop count
` mtspr ctr,r5 ; loop count
` addi r3,0,1 ; r3 = 1
`iClearLoop
` sthux r3,r11,r4 ; f(k)=1
` bc 16,0,iClearLoop ; bdnz iClearLoop
` addi r10,0,0 ; i=0
`iLoop
` add r4,r10,r10 ; i*2
` lhzx r3,r9,r4 ; f(i)
` cmpi cr0,0,r3,0 ; cmpwi r3,0
` bc 4,1,skip ; ble skip
` addi r4,r4,3 ; p = i*2 + 3
` add r5,r4,r10 ; k = i+p
` cmpi cr0,0,r5,8192 ; cmpwi r5,8192
` bc 4,0,incrementC ; bge incrementC
` add r8,r5,r5 ; k*2 index into INT array
` add r7,r4,r4 ; p*2
` addi r3,0,0 ; r3=0
` add r11,r9,r8
` subf r11,r7,r11 ; adjust index for sthux
`kLoop ; the inner loop
` add r5,r5,r4
` cmpi cr0,0,r5,8191 ; cmpwi r5,8191
` sthux r3,r11,r7 ; f(k)=0
` bc 4,1,kLoop ; ble kLoop
`incrementC
` addi r6,r6,1 ; c=c+1
`skip
` addi r10,r10,1 ; i=i+1
` cmpi cr0,0,r10,8191; cmpwi r10,8191
` bc 4,1,iLoop ; ble iLoop
` sth r6,^c ; store c
NEXT loops
t&=FN TICKCOUNT-t&
PRINT c;" primes "; t&;" ticks"
END FN
#ENDIF
'----------------------------------------------