[Coco] A Great Old-Timey Game-Programming Hack - Tom Moertel’s Blog
Mathieu Bouchard
matju at artengine.ca
Wed Jan 1 15:45:15 EST 2014
Le 2013-12-17 à 14:39:00, Allen Huffman a écrit :
> http://blog.moertel.com/posts/2013-12-14-great-old-timey-game-programming-hack.html
>
> A discussion of techniques used to speed up a 6809 graphical copy loop.
> He was working on a CoCo 3 on a friend's game, but didn't reveal what
> the game was. A very good read with plenty of example 6809 code on how
> he sped up the block copies quite a bit. - Allen Huffman - PO Box 22031
the part where the INC of the loop-unrolling are merged (the one that
has stuff like LDU 10,X in it) is something that I've done explicitly for
a project about 10 years ago (a library for multidimensional arrays),
except it's in C++. For example :
while (n>0) {
out[x ] = in[x ]*k;
out[x+1] = in[x+1]*k;
out[x+2] = in[x+2]*k;
out[x+3] = in[x+3]*k;
x+=4;
n--;
}
... if x is a multiple of 4. Otherwise, there are tricks involving
wrapping a while() in a switch() and having the case-labels of the switch
inside of the while(). The compiler typically optimises it as an indirect
jump into a table of labels. There are other equivalent ways of writing
it, but you need one goto so that the first iteration has a variable
number of statements in it (and so that you don't have a bunch of
if()break;).
But at about the same time I did that, compilers of the GCC family started
optimising that kind of thing automatically... They had done some amount
of basic unrolling for some time, but I think it was only for fixed
numbers of iterations, and then it became about variable number of
iterations too. Nowadays, you can get quite tight machine code by
just saying :
while (n>0) {out[x]=in[x]*k; x++; n--;}
and though there are more tricks, PSH/PUL isn't faster than LD/ST on
modern processors. Perhaps it did on early x86 processors, I don't recall
well. But ratios of speeds of different tactics have changed a lot over
time : loop-unrolling has becoming much more of a speed boost, because
conditional jumps are hard for the cpu to accelerate, compared to almost
everything else ; so a single BRLE could easily appear to take as much cpu
as sixteen ADD or more.
BTW, in the early 90s on the x86, there was « sprite compilation » : the
immediate mode of the cpu was so much faster than loading array data from
the normal data path, that in some cases it was worth making a massive
loop unrolling so that there isn't a loop anymore, and then pouring ALL of
the array's data in it so that the machine code doesn't even look at the
array. Did anybody do that on the CoCo ? Was it ever useful on that CPU ?
______________________________________________________________________
| Mathieu BOUCHARD ----- téléphone : +1.514.383.3801 ----- Montréal, QC
More information about the Coco
mailing list