[Coco] Mod10 Suggestions

Sat Feb 18 23:15:15 EST 2017

How much speed would you gain by completely eliminating 8 DECBs and 7 BNEs?:

ORG $1200
CCD     RMB 16
RESULT  RMB 1

START   LEAX CCD+16,PCR
CLRA

LOOP    ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP2
         SUBA #9
LOOP2   ADDA ,S+
         DAA

         ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP3
         SUBA #9
LOOP3   ADDA ,S+
         DAA

         ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP4
         SUBA #9
LOOP4   ADDA ,S+
         DAA

         ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP5
         SUBA #9
LOOP5   ADDA ,S+
         DAA

         ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP6
         SUBA #9
LOOP6   ADDA ,S+
         DAA

         ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP7
         SUBA #9
LOOP7   ADDA ,S+
         DAA

         ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP8
         SUBA #9
LOOP8   ADDA ,S+
         DAA

         ADDA ,-X
         DAA
         PSHS A
         LDA ,-X
         LSLA
         CMPA #10
         BLO LOOP9
         SUBA #9
LOOP9   ADDA ,S+
         DAA

         ANDA #$0F
         STA RESULT,PCR
ENDPGM  RTS
END START

On 2/18/2017 8:22 PM, William Mikrut wrote:
> Which is the beauty of this project.
>
> Clearly there are at least 3 ways to do this...each with a slightly
> different outcome.
>
> Some optimization for size,speed... or both.
>
> There is a wealth of information and experience here from everone and I
> truly appreciate all the input!
>
> I can't wait to start the next project and see where it leads!!
>
>
>
> On Feb 18, 2017 8:10 PM, "L. Curtis Boyle" <curtisboyle at sasktel.net> wrote:
>
>> I was just going to mention that if speed is more important, doing an leas
>> -1,s before the loop, and then just a sta ,a /adda ,s (instead of pshs
>> a/add ,s+), and then a final leas 1,s after the loop is done would be a bit
>> longer, but a bit faster.
>>
>> L. Curtis Boyle
>> curtisboyle at sasktel.net
>>
>> TRS-80 Color Computer Games website
>> http://www.lcurtisboyle.com/nitros9/coco_game_list.html
>>
>>
>>
>>> On Feb 18, 2017, at 7:41 PM, Dave Philipsen <dave at davebiz.com> wrote:
>>>
>>> That's pretty well optimized!  Have you ever considered the difference
>> between optimizing for size and optimizing for speed?  So, for instance, if
>> you weren't necessarily constrained for size but you knew you were going to
>> process a list of jillions of cc numbers would you write it differently?
>>> Dave Philipsen
>>>
>>>> On Feb 18, 2017, at 5:06 PM, William Mikrut <wmikrut72 at gmail.com>
>> wrote:
>>>> Some slight re ordering of the code and it works perfectly!
>>>> 48 Bytes total, Less 17 for storage -- 31 program bytes to get the job
>> done.
>>>> My original code was 61 program bytes... down to half the size and does
>> the
>>>> exact same thing.
>>>> Absolutely amazing!
>>>>
>>>>
>>>> ORG $1200
>>>> CCD     RMB 16
>>>> RESULT  RMB 1
>>>>
>>>> START   LEAX CCD+16,PCR
>>>> CLRA
>>>>        LDB #8
>>>>
>>>>
>>>> LOOP    ADDA ,-X
>>>>        DAA
>>>>        PSHS A
>>>>        LDA ,-X
>>>>        LSLA
>>>>        CMPA #10
>>>>        BLO LOOP2
>>>>        SUBA #9
>>>> LOOP2   ADDA ,S+
>>>>        DAA
>>>>
>>>>        DECB
>>>>        BNE LOOP
>>>>
>>>>
>>>>
>>>>        ANDA #$0F
>>>>        STA RESULT,PCR
>>>> ENDPGM  RTS
>>>> END START
>>>>
>>>>> On Sat, Feb 18, 2017 at 1:03 PM, William Mikrut <wmikrut72 at gmail.com>
>> wrote:
>>>>> You are right -- I looked at is closer.
>>>>> One thing I need to do is reverse the order of operations.
>>>>>
>>>>> The LSLA is performed first.
>>>>> First I need to store the byte and LSLA the next byte.
>>>>>
>>>>> Otherwise if I flip it from left to right:
>>>>> (LEAX CCD,PCR
>>>>> ...
>>>>> LDA ,X+
>>>>> ...
>>>>> ADDA ,X+)
>>>>>
>>>>> it works perfectly.
>>>>>
>>>>>
>>>>>> On Sat, Feb 18, 2017 at 11:35 AM, William Astle <lost at l-w.ca> wrote:
>>>>>>
>>>>>> Take a closer look. It only does the LSLA on every other digit. It
>> does
>>>>>> *two* digits  per loop, just like Brett's version.
>>>>>>
>>>>>> You can easily pretend all numbers are 16 digits by right justifying
>> the
>>>>>> numbers in your buffer and padding with zeros.
>>>>>>
>>>>>>
>>>>>>> On 2017-02-18 10:06 AM, William Mikrut wrote:
>>>>>>>
>>>>>>> I like how this works from right to left.
>>>>>>> The only issue is the LSLA on every number.
>>>>>>>
>>>>>>> The algo is to double every other number, starting with the right
>> most
>>>>>>> digit, and sub 9 if the result is 10 or more.
>>>>>>>
>>>>>>> Now if the number is always 16 digits, Brett's 16 bit word seems the
>>>>>>> easiest way to go.
>>>>>>> If the number is 13 digits long the 16 bit word method won't work,
>> but I
>>>>>>> am
>>>>>>> happy to pretend all numbers are 16 digits!
>>>>>>>
>>>>>>> I am going to try to include a couple things you showed me into
>> Brett's
>>>>>>> 16
>>>>>>> bit chunk method and try a slightly different routine!
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Feb 18, 2017 at 10:22 AM, William Astle <lost at l-w.ca> wrote:
>>>>>>>
>>>>>>> On 2017-02-18 12:43 AM, msmcdoug wrote:
>>>>>>>> Actually I'm surprised noone has suggested bcd arithmetic on the
>> result
>>>>>>>>> to eliminate divide by 10 loop
>>>>>>>>>
>>>>>>>>>
>>>>>>>> BCD would certainly give a predictable overall cycle count. It would
>>>>>>>> require a significantly different approach, though. The only
>> register
>>>>>>>> you
>>>>>>>> can use for BCD arithmetic is A and DAA is only useful after ADDA or
>>>>>>>> ADCA.
>>>>>>>>
>>>>>>>> I had thought about using BCD but had initially dismissed it due to
>>>>>>>> possible complexity. However, upon reflection, the extra cycles to
>> use
>>>>>>>> BCD
>>>>>>>> would probably be less than the average cycle time of the modulus
>> loop
>>>>>>>> combined or checking for digit overflow during the loop.
>>>>>>>>
>>>>>>>> I think you could use code that looks something like the following
>> which
>>>>>>>> is based off Mr. Mikrut's most recent posted code. (warning: mailer
>>>>>>>> code™
>>>>>>>> follows so it may have errors)
>>>>>>>>
>>>>>>>>        ORG $1200
>>>>>>>> CCD     RMB 16
>>>>>>>> RESULT  RMB 1
>>>>>>>> START   LEAX CCD+16,PCR
>>>>>>>>        CLRA
>>>>>>>>        LDB #8
>>>>>>>> LOOP    PSHS A
>>>>>>>>        LDA ,-X
>>>>>>>>        LSLA
>>>>>>>>        CMPA #10
>>>>>>>>        BLO LOOP2
>>>>>>>>        SUBA #9
>>>>>>>> LOOP2   ADDA ,S+
>>>>>>>>        DAA
>>>>>>>>        ADDA ,-X
>>>>>>>>        DAA
>>>>>>>>        DECB
>>>>>>>>        BNE LOOP
>>>>>>>>        ANDA #$0F
>>>>>>>>        STA RESULT,PCR
>>>>>>>> ENDPGM  RTS
>>>>>>>>
>>>>>>>> I'm using the stack for a temporary storage location instead of
>>>>>>>> something
>>>>>>>> PCR relative for code size reasons. You could use the "RESULT
>> variable
>>>>>>>> for
>>>>>>>> the temporary to eliminate stack usage. That would probably be
>> slightly
>>>>>>>> faster at the expense of two more code bytes. This is one of those
>>>>>>>> size/speed trade-offs.
>>>>>>>>
>>>>>>>> DAA has to be used after every addition and only applies to A.
>> Using BCD
>>>>>>>> means we can eliminate the mod 10 loop and just mask off the upper
>> digit
>>>>>>>> (BCD stores two decimal digits in a byte). That gives a constant
>> time
>>>>>>>> for
>>>>>>>> the "mod 10" result and also only takes 2 bytes (and 2 cycles).
>>>>>>>>
>>>>>>>> I have also eliminated the STATUS variable and just store the
>> result.
>>>>>>>> You
>>>>>>>> can test RESULT for non-zero trivially so there's no need for a
>> separate
>>>>>>>> STATUS value.
>>>>>>>>
>>>>>>>> By my calculation, this version is 32 bytes, requires 1 byte of
>> stack
>>>>>>>> space, 17 bytes of data space, and runs in a maximum of 351 cycles
>> (and
>>>>>>>> a
>>>>>>>> minimum of 336 cycles if none of the doubled digits goes above 9).
>> For
>>>>>>>> this
>>>>>>>> analysis, I've assumed 8 bit offsets for the PCR references. 16 bit
>>>>>>>> offsets
>>>>>>>> in PCR mode are quite a bit more expensive (4 extra cycles and 1
>> extra
>>>>>>>> byte).
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Coco mailing list
>>>>>>>> Coco at maltedmedia.com
>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>> Coco mailing list
>>>>>> Coco at maltedmedia.com
>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>
>>>>>
>>>> --
>>>> Coco mailing list
>>>> Coco at maltedmedia.com
>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>
>>> --
>>> Coco mailing list
>>> Coco at maltedmedia.com
>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>
>>
>> --
>> Coco mailing list
>> Coco at maltedmedia.com
>> https://pairlist5.pair.net/mailman/listinfo/coco
>>