[Coco] Mod10 Suggestions
Dave Philipsen
dave at davebiz.com
Sun Feb 19 01:25:10 EST 2017
And of course you can apply the same principle to stack blasting:
PSHS <blah, blah, blah>
PSHS <blah, blah, blah>
PSHS <blah, blah, blah>
etc. before doing a test and a branch.
On another note, the CoCo3FPGA has some hardware tricks that help in that regard too. There is a system in place where you can set a 24-bit direct hardware pointer to memory and then write a byte directly to the memory with the hardware pointer being auto-incremented. This enables very fast writes to memory without using the MMU and without needing an (software) index register although you still have to somehow keep track of how much memory you're writing to. With this system and a 25 MHz core and some 'non-spaghetti' code it can clear a 288,000-byte graphics page (640x450x256-color) in less than the blink of an eye!
Dave
Dave Philipsen
> On Feb 18, 2017, at 11:54 PM, L. Curtis Boyle <curtisboyle at sasktel.net> wrote:
>
> Or do something like:
> CLR ,x
> CLR 1,x
> ...
> CLR 7,x
> LEAX 8,x
>
> But if you are doing chunks of contiguous RAM, stack blasting is the way to go.
>
> Sent from my iPhone
>
>> On Feb 18, 2017, at 11:24 PM, Dave Philipsen <dave at davebiz.com> wrote:
>>
>> Yep. I thought I had remembered that from somewhere.
>>
>> When you're constrained by the size of a ROM or just the amount of memory on hand or in an embedded system or MCU you sometimes must optimize for size. My first CoCo had 4K! But nowadays there's usually enough memory available unless you're working on a pretty big program. I'll admit that I don't always have time to optimize code but sometimes I look at repetitive code sections like that and string them out in order to speed things up especially if I know the code section is going to get used a lot. One example would be to clear data in memory or to clear graphics memory. Instead of a tight loop like:
>>
>> LOOP CLR ,X+
>> CMPX <something>
>> BNE LOOP
>>
>> I might use:
>>
>> LOOP CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CMPX <something>
>> BNE LOOP
>>
>> If you're working on a large chunk of memory it can make a discernible difference.
>>
>>
>> Dave
>>
>>
>>> On 2/18/2017 11:06 PM, William Astle wrote:
>>> You're thinking of the long branches. The "short" branches are the same either way.
>>>
>> Yep. I thought I had remembered that from somewhere.
>>
>> When you're constrained by the size of a ROM or just the amount of memory on hand or in an embedded system or MCU you sometimes must optimize for size. My first CoCo had 4K! But nowadays there's usually enough memory available unless you're working on a pretty big program. I'll admit that I don't always have time to optimize code but sometimes I look at repetitive code sections like that and string them out in order to speed things up especially if I know the code section is going to get used a lot. One example would be to clear data in memory or to clear graphics memory. Instead of a tight loop like:
>>
>> LOOP CLR ,X+
>> CMPX <something>
>> BNE LOOP
>>
>> I might use:
>>
>> LOOP CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CLR ,X+
>> CMPX <something>
>> BNE LOOP
>>
>> If you're working on a large chunk of memory it can make a discernible difference.
>>
>>
>> Dave
>>
>>
>>> On 2/18/2017 11:06 PM, William Astle wrote:
>>> You're thinking of the long branches. The "short" branches are the same either way.
>>>
>>>> On 2017-02-18 10:02 PM, Dave Philipsen wrote:
>>>> Or maybe not, after all....
>>>>
>>>>> On 2/18/2017 10:57 PM, Dave Philipsen wrote:
>>>>> Yeah, I think the BNE is one less cycle if the branch isn't taken, right?
>>>>>
>>>>> Dave
>>>>>
>>>>>
>>>>>> On 2/18/2017 10:53 PM, William Astle wrote:
>>>>>> It would be 8 BNEs actually. It's executed even for the last loop.
>>>>>>
>>>>>> BNE is 3 cycles and DECB is 2 cycles so 40 cycles total.
>>>>>>
>>>>>> You can also save a cycle for each "temporary" reference by just
>>>>>> using RESULT as the temporary instead of using the stack. It's one
>>>>>> byte longer but one cycle faster as long as RESULT is in range of an
>>>>>> 8 bit offset from PC. That would be 2 cycles gained per iteration for
>>>>>> a total of 16 cycles. It's faster to use the stack if a PCR access to
>>>>>> result would need a 16 bit offset.
>>>>>>
>>>>>>
>>>>>>> On 2017-02-18 09:15 PM, Dave Philipsen wrote:
>>>>>>> How much speed would you gain by completely eliminating 8 DECBs and 7
>>>>>>> BNEs?:
>>>>>>>
>>>>>>> ORG $1200
>>>>>>> CCD RMB 16
>>>>>>> RESULT RMB 1
>>>>>>>
>>>>>>> START LEAX CCD+16,PCR
>>>>>>> CLRA
>>>>>>>
>>>>>>> LOOP ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP2
>>>>>>> SUBA #9
>>>>>>> LOOP2 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP3
>>>>>>> SUBA #9
>>>>>>> LOOP3 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP4
>>>>>>> SUBA #9
>>>>>>> LOOP4 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP5
>>>>>>> SUBA #9
>>>>>>> LOOP5 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP6
>>>>>>> SUBA #9
>>>>>>> LOOP6 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP7
>>>>>>> SUBA #9
>>>>>>> LOOP7 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP8
>>>>>>> SUBA #9
>>>>>>> LOOP8 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ADDA ,-X
>>>>>>> DAA
>>>>>>> PSHS A
>>>>>>> LDA ,-X
>>>>>>> LSLA
>>>>>>> CMPA #10
>>>>>>> BLO LOOP9
>>>>>>> SUBA #9
>>>>>>> LOOP9 ADDA ,S+
>>>>>>> DAA
>>>>>>>
>>>>>>> ANDA #$0F
>>>>>>> STA RESULT,PCR
>>>>>>> ENDPGM RTS
>>>>>>> END START
>>>>>>>
>>>>>>>> On 2/18/2017 8:22 PM, William Mikrut wrote:
>>>>>>>> Which is the beauty of this project.
>>>>>>>>
>>>>>>>> Clearly there are at least 3 ways to do this...each with a slightly
>>>>>>>> different outcome.
>>>>>>>>
>>>>>>>> Some optimization for size,speed... or both.
>>>>>>>>
>>>>>>>> There is a wealth of information and experience here from everone
>>>>>>>> and I
>>>>>>>> truly appreciate all the input!
>>>>>>>>
>>>>>>>> I can't wait to start the next project and see where it leads!!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Feb 18, 2017 8:10 PM, "L. Curtis Boyle" <curtisboyle at sasktel.net>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I was just going to mention that if speed is more important, doing an
>>>>>>>>> leas
>>>>>>>>> -1,s before the loop, and then just a sta ,a /adda ,s (instead of
>>>>>>>>> pshs
>>>>>>>>> a/add ,s+), and then a final leas 1,s after the loop is done would be
>>>>>>>>> a bit
>>>>>>>>> longer, but a bit faster.
>>>>>>>>>
>>>>>>>>> L. Curtis Boyle
>>>>>>>>> curtisboyle at sasktel.net
>>>>>>>>>
>>>>>>>>> TRS-80 Color Computer Games website
>>>>>>>>> http://www.lcurtisboyle.com/nitros9/coco_game_list.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Feb 18, 2017, at 7:41 PM, Dave Philipsen <dave at davebiz.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> That's pretty well optimized! Have you ever considered the
>>>>>>>>>> difference
>>>>>>>>> between optimizing for size and optimizing for speed? So, for
>>>>>>>>> instance, if
>>>>>>>>> you weren't necessarily constrained for size but you knew you were
>>>>>>>>> going to
>>>>>>>>> process a list of jillions of cc numbers would you write it
>>>>>>>>> differently?
>>>>>>>>>> Dave Philipsen
>>>>>>>>>>
>>>>>>>>>>> On Feb 18, 2017, at 5:06 PM, William Mikrut <wmikrut72 at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>> Some slight re ordering of the code and it works perfectly!
>>>>>>>>>>> 48 Bytes total, Less 17 for storage -- 31 program bytes to get
>>>>>>>>>>> the job
>>>>>>>>> done.
>>>>>>>>>>> My original code was 61 program bytes... down to half the size and
>>>>>>>>>>> does
>>>>>>>>> the
>>>>>>>>>>> exact same thing.
>>>>>>>>>>> Absolutely amazing!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ORG $1200
>>>>>>>>>>> CCD RMB 16
>>>>>>>>>>> RESULT RMB 1
>>>>>>>>>>>
>>>>>>>>>>> START LEAX CCD+16,PCR
>>>>>>>>>>> CLRA
>>>>>>>>>>> LDB #8
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> LOOP ADDA ,-X
>>>>>>>>>>> DAA
>>>>>>>>>>> PSHS A
>>>>>>>>>>> LDA ,-X
>>>>>>>>>>> LSLA
>>>>>>>>>>> CMPA #10
>>>>>>>>>>> BLO LOOP2
>>>>>>>>>>> SUBA #9
>>>>>>>>>>> LOOP2 ADDA ,S+
>>>>>>>>>>> DAA
>>>>>>>>>>>
>>>>>>>>>>> DECB
>>>>>>>>>>> BNE LOOP
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ANDA #$0F
>>>>>>>>>>> STA RESULT,PCR
>>>>>>>>>>> ENDPGM RTS
>>>>>>>>>>> END START
>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Feb 18, 2017 at 1:03 PM, William Mikrut
>>>>>>>>>>>> <wmikrut72 at gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>> You are right -- I looked at is closer.
>>>>>>>>>>>> One thing I need to do is reverse the order of operations.
>>>>>>>>>>>>
>>>>>>>>>>>> The LSLA is performed first.
>>>>>>>>>>>> First I need to store the byte and LSLA the next byte.
>>>>>>>>>>>>
>>>>>>>>>>>> Otherwise if I flip it from left to right:
>>>>>>>>>>>> (LEAX CCD,PCR
>>>>>>>>>>>> ...
>>>>>>>>>>>> LDA ,X+
>>>>>>>>>>>> ...
>>>>>>>>>>>> ADDA ,X+)
>>>>>>>>>>>>
>>>>>>>>>>>> it works perfectly.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 11:35 AM, William Astle <lost at l-w.ca>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Take a closer look. It only does the LSLA on every other
>>>>>>>>>>>>> digit. It
>>>>>>>>> does
>>>>>>>>>>>>> *two* digits per loop, just like Brett's version.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You can easily pretend all numbers are 16 digits by right
>>>>>>>>>>>>> justifying
>>>>>>>>> the
>>>>>>>>>>>>> numbers in your buffer and padding with zeros.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2017-02-18 10:06 AM, William Mikrut wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I like how this works from right to left.
>>>>>>>>>>>>>> The only issue is the LSLA on every number.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The algo is to double every other number, starting with the
>>>>>>>>>>>>>> right
>>>>>>>>> most
>>>>>>>>>>>>>> digit, and sub 9 if the result is 10 or more.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Now if the number is always 16 digits, Brett's 16 bit word seems
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> easiest way to go.
>>>>>>>>>>>>>> If the number is 13 digits long the 16 bit word method won't
>>>>>>>>>>>>>> work,
>>>>>>>>> but I
>>>>>>>>>>>>>> am
>>>>>>>>>>>>>> happy to pretend all numbers are 16 digits!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am going to try to include a couple things you showed me into
>>>>>>>>> Brett's
>>>>>>>>>>>>>> 16
>>>>>>>>>>>>>> bit chunk method and try a slightly different routine!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Feb 18, 2017 at 10:22 AM, William Astle <lost at l-w.ca>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 2017-02-18 12:43 AM, msmcdoug wrote:
>>>>>>>>>>>>>>> Actually I'm surprised noone has suggested bcd arithmetic on
>>>>>>>>>>>>>>> the
>>>>>>>>> result
>>>>>>>>>>>>>>>> to eliminate divide by 10 loop
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> BCD would certainly give a predictable overall cycle count. It
>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> require a significantly different approach, though. The only
>>>>>>>>> register
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> can use for BCD arithmetic is A and DAA is only useful after
>>>>>>>>>>>>>>> ADDA or
>>>>>>>>>>>>>>> ADCA.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I had thought about using BCD but had initially dismissed it
>>>>>>>>>>>>>>> due to
>>>>>>>>>>>>>>> possible complexity. However, upon reflection, the extra
>>>>>>>>>>>>>>> cycles to
>>>>>>>>> use
>>>>>>>>>>>>>>> BCD
>>>>>>>>>>>>>>> would probably be less than the average cycle time of the
>>>>>>>>>>>>>>> modulus
>>>>>>>>> loop
>>>>>>>>>>>>>>> combined or checking for digit overflow during the loop.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think you could use code that looks something like the
>>>>>>>>>>>>>>> following
>>>>>>>>> which
>>>>>>>>>>>>>>> is based off Mr. Mikrut's most recent posted code. (warning:
>>>>>>>>>>>>>>> mailer
>>>>>>>>>>>>>>> codeā¢
>>>>>>>>>>>>>>> follows so it may have errors)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ORG $1200
>>>>>>>>>>>>>>> CCD RMB 16
>>>>>>>>>>>>>>> RESULT RMB 1
>>>>>>>>>>>>>>> START LEAX CCD+16,PCR
>>>>>>>>>>>>>>> CLRA
>>>>>>>>>>>>>>> LDB #8
>>>>>>>>>>>>>>> LOOP PSHS A
>>>>>>>>>>>>>>> LDA ,-X
>>>>>>>>>>>>>>> LSLA
>>>>>>>>>>>>>>> CMPA #10
>>>>>>>>>>>>>>> BLO LOOP2
>>>>>>>>>>>>>>> SUBA #9
>>>>>>>>>>>>>>> LOOP2 ADDA ,S+
>>>>>>>>>>>>>>> DAA
>>>>>>>>>>>>>>> ADDA ,-X
>>>>>>>>>>>>>>> DAA
>>>>>>>>>>>>>>> DECB
>>>>>>>>>>>>>>> BNE LOOP
>>>>>>>>>>>>>>> ANDA #$0F
>>>>>>>>>>>>>>> STA RESULT,PCR
>>>>>>>>>>>>>>> ENDPGM RTS
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm using the stack for a temporary storage location instead of
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>> PCR relative for code size reasons. You could use the "RESULT
>>>>>>>>> variable
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> the temporary to eliminate stack usage. That would probably be
>>>>>>>>> slightly
>>>>>>>>>>>>>>> faster at the expense of two more code bytes. This is one of
>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>> size/speed trade-offs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> DAA has to be used after every addition and only applies to A.
>>>>>>>>> Using BCD
>>>>>>>>>>>>>>> means we can eliminate the mod 10 loop and just mask off the
>>>>>>>>>>>>>>> upper
>>>>>>>>> digit
>>>>>>>>>>>>>>> (BCD stores two decimal digits in a byte). That gives a
>>>>>>>>>>>>>>> constant
>>>>>>>>> time
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> the "mod 10" result and also only takes 2 bytes (and 2 cycles).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have also eliminated the STATUS variable and just store the
>>>>>>>>> result.
>>>>>>>>>>>>>>> You
>>>>>>>>>>>>>>> can test RESULT for non-zero trivially so there's no need for a
>>>>>>>>> separate
>>>>>>>>>>>>>>> STATUS value.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> By my calculation, this version is 32 bytes, requires 1 byte of
>>>>>>>>> stack
>>>>>>>>>>>>>>> space, 17 bytes of data space, and runs in a maximum of 351
>>>>>>>>>>>>>>> cycles
>>>>>>>>> (and
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> minimum of 336 cycles if none of the doubled digits goes
>>>>>>>>>>>>>>> above 9).
>>>>>>>>> For
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>> analysis, I've assumed 8 bit offsets for the PCR references. 16
>>>>>>>>>>>>>>> bit
>>>>>>>>>>>>>>> offsets
>>>>>>>>>>>>>>> in PCR mode are quite a bit more expensive (4 extra cycles
>>>>>>>>>>>>>>> and 1
>>>>>>>>> extra
>>>>>>>>>>>>>>> byte).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Coco mailing list
>>>>>>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Coco mailing list
>>>>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Coco mailing list
>>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Coco mailing list
>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Coco mailing list
>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Coco mailing list
>> Coco at maltedmedia.com
>> https://pairlist5.pair.net/mailman/listinfo/coco
>>
>
>
> --
> Coco mailing list
> Coco at maltedmedia.com
> https://pairlist5.pair.net/mailman/listinfo/coco
More information about the Coco
mailing list