Thursday 6 March 2014

Re: Changing default CFLAGS on i386

hi,

On Thu, Mar 6, 2014, at 2:23, Adam Conrad wrote:
> I wouldn't be entirely against this option, if the performance hit is
> measurably not awful in general purpose usage.

So I did some measurements against the 'radial-perf-test' in pixman,
compiled with all of the special asm/mmx/sse2/etc. backends disabled
(ie: plain C floating point code). I have no idea what this code is
doing, but I figured it might be a good test. I might have accidentally
picked something hideously non-representative. I only wanted to get a
rough idea, without spending too much time on this.

The baseline for 32bit with i686 march is "Average time to composite:
0.037647".

Adding -fexcess-precision=standard gives 0.040273 (+ 7%). That's a
reasonable hit on FP-heavy code.

SSE2 beats -fexcess-precision but it doesn't really improve on the
baseline -- in fact,
-march=pentium4\ -mfpmath=sse\ -mtune=generic gives almost exactly the
same result as where we are today: Average time to composite: 0.037669.
The advantage here is that we now have a standards-compliant C compiler.

We get a slight improvement if we turn on -march=pentium4\
-mtune=generic without forcing the compiler into SSE for math: Average
time to composite: 0.036601. That's ~3% better than today.

I'm slightly surprised that pentium4+sse2 only ties the existing
-march=i686 flags (although it beats it by actually being
standards-correct) and in particular I'm surprised that forcing SSE math
slows things down vs. -march=pentium4 alone. I'm not sure the reason
for this. It could be that the SSE2 instructions are truly a slower way
of doing the math. It could also be that the compiler has received less
optimisation attention here due to it being a non-default option.

I did another test with a simple program that approximates the tight
inner loop in a mandlebrot set calculation. It saw similar results in
terms of i686 vs. pentium4 and sse (i686 ~= sse, plain pentium4 ~2%
faster). In this case the performance hit of
-fexcess-precision=standard was much worse, though: +40%.

In short: I'm dismayed to report that turning on '-march=pentium4
-mfpmath=sse -mtune=generic' gives no performance improvement on this
particular piece of code.

If we approach this problem from the standpoint of "we must provide a C
compiler that adheres to standards" then using these options does give a
substantial improvement on fp-heavy code over the alternative of using
-fexcess-precision=standard.

Cheers

--
ubuntu-devel mailing list
ubuntu-devel@lists.ubuntu.com
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel