[Coco] A Great Old-Timey Game-Programming Hack - Tom Moertel’s Blog

Wed Jan 1 15:45:15 EST 2014

Le 2013-12-17 à 14:39:00, Allen Huffman a écrit :

> http://blog.moertel.com/posts/2013-12-14-great-old-timey-game-programming-hack.html
>
> A discussion of techniques used to speed up a 6809 graphical copy loop. 
> He was working on a CoCo 3 on a friend's game, but didn't reveal what 
> the game was. A very good read with plenty of example 6809 code on how 
> he sped up the block copies quite a bit. - Allen Huffman - PO Box 22031

the part where the INC of the loop-unrolling are merged (the one that 
has stuff like LDU 10,X in it) is something that I've done explicitly for 
a project about 10 years ago (a library for multidimensional arrays), 
except it's in C++. For example :

while (n>0) {
   out[x  ] = in[x  ]*k;
   out[x+1] = in[x+1]*k;
   out[x+2] = in[x+2]*k;
   out[x+3] = in[x+3]*k;
   x+=4;
   n--;
}

... if x is a multiple of 4. Otherwise, there are tricks involving 
wrapping a while() in a switch() and having the case-labels of the switch 
inside of the while(). The compiler typically optimises it as an indirect 
jump into a table of labels. There are other equivalent ways of writing 
it, but you need one goto so that the first iteration has a variable 
number of statements in it (and so that you don't have a bunch of 
if()break;).

But at about the same time I did that, compilers of the GCC family started 
optimising that kind of thing automatically... They had done some amount 
of basic unrolling for some time, but I think it was only for fixed 
numbers of iterations, and then it became about variable number of 
iterations too. Nowadays, you can get quite tight machine code by 
just saying :

while (n>0) {out[x]=in[x]*k; x++; n--;}

and though there are more tricks, PSH/PUL isn't faster than LD/ST on 
modern processors. Perhaps it did on early x86 processors, I don't recall 
well. But ratios of speeds of different tactics have changed a lot over 
time : loop-unrolling has becoming much more of a speed boost, because 
conditional jumps are hard for the cpu to accelerate, compared to almost 
everything else ; so a single BRLE could easily appear to take as much cpu 
as sixteen ADD or more.

BTW, in the early 90s on the x86, there was « sprite compilation » : the 
immediate mode of the cpu was so much faster than loading array data from 
the normal data path, that in some cases it was worth making a massive 
loop unrolling so that there isn't a loop anymore, and then pouring ALL of 
the array's data in it so that the machine code doesn't even look at the 
array. Did anybody do that on the CoCo ? Was it ever useful on that CPU ?

  ______________________________________________________________________
| Mathieu BOUCHARD ----- téléphone : +1.514.383.3801 ----- Montréal, QC