[Coco] Mod10 Suggestions

Sun Feb 19 00:06:44 EST 2017

You're thinking of the long branches. The "short" branches are the same 
either way.

On 2017-02-18 10:02 PM, Dave Philipsen wrote:
> Or maybe not, after all....
>
> On 2/18/2017 10:57 PM, Dave Philipsen wrote:
>> Yeah, I think the BNE is one less cycle if the branch isn't taken, right?
>>
>> Dave
>>
>>
>> On 2/18/2017 10:53 PM, William Astle wrote:
>>> It would be 8 BNEs actually. It's executed even for the last loop.
>>>
>>> BNE is 3 cycles and DECB is 2 cycles so 40 cycles total.
>>>
>>> You can also save a cycle for each "temporary" reference by just
>>> using RESULT as the temporary instead of using the stack. It's one
>>> byte longer but one cycle faster as long as RESULT is in range of an
>>> 8 bit offset from PC. That would be 2 cycles gained per iteration for
>>> a total of 16 cycles. It's faster to use the stack if a PCR access to
>>> result would need a 16 bit offset.
>>>
>>>
>>> On 2017-02-18 09:15 PM, Dave Philipsen wrote:
>>>> How much speed would you gain by completely eliminating 8 DECBs and 7
>>>> BNEs?:
>>>>
>>>> ORG $1200
>>>> CCD     RMB 16
>>>> RESULT  RMB 1
>>>>
>>>> START   LEAX CCD+16,PCR
>>>> CLRA
>>>>
>>>> LOOP    ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP2
>>>>         SUBA #9
>>>> LOOP2   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP3
>>>>         SUBA #9
>>>> LOOP3   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP4
>>>>         SUBA #9
>>>> LOOP4   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP5
>>>>         SUBA #9
>>>> LOOP5   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP6
>>>>         SUBA #9
>>>> LOOP6   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP7
>>>>         SUBA #9
>>>> LOOP7   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP8
>>>>         SUBA #9
>>>> LOOP8   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ADDA ,-X
>>>>         DAA
>>>>         PSHS A
>>>>         LDA ,-X
>>>>         LSLA
>>>>         CMPA #10
>>>>         BLO LOOP9
>>>>         SUBA #9
>>>> LOOP9   ADDA ,S+
>>>>         DAA
>>>>
>>>>         ANDA #$0F
>>>>         STA RESULT,PCR
>>>> ENDPGM  RTS
>>>> END START
>>>>
>>>> On 2/18/2017 8:22 PM, William Mikrut wrote:
>>>>> Which is the beauty of this project.
>>>>>
>>>>> Clearly there are at least 3 ways to do this...each with a slightly
>>>>> different outcome.
>>>>>
>>>>> Some optimization for size,speed... or both.
>>>>>
>>>>> There is a wealth of information and experience here from everone
>>>>> and I
>>>>> truly appreciate all the input!
>>>>>
>>>>> I can't wait to start the next project and see where it leads!!
>>>>>
>>>>>
>>>>>
>>>>> On Feb 18, 2017 8:10 PM, "L. Curtis Boyle" <curtisboyle at sasktel.net>
>>>>> wrote:
>>>>>
>>>>>> I was just going to mention that if speed is more important, doing an
>>>>>> leas
>>>>>> -1,s before the loop, and then just a sta ,a /adda ,s (instead of
>>>>>> pshs
>>>>>> a/add ,s+), and then a final leas 1,s after the loop is done would be
>>>>>> a bit
>>>>>> longer, but a bit faster.
>>>>>>
>>>>>> L. Curtis Boyle
>>>>>> curtisboyle at sasktel.net
>>>>>>
>>>>>> TRS-80 Color Computer Games website
>>>>>> http://www.lcurtisboyle.com/nitros9/coco_game_list.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Feb 18, 2017, at 7:41 PM, Dave Philipsen <dave at davebiz.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> That's pretty well optimized!  Have you ever considered the
>>>>>>> difference
>>>>>> between optimizing for size and optimizing for speed?  So, for
>>>>>> instance, if
>>>>>> you weren't necessarily constrained for size but you knew you were
>>>>>> going to
>>>>>> process a list of jillions of cc numbers would you write it
>>>>>> differently?
>>>>>>> Dave Philipsen
>>>>>>>
>>>>>>>> On Feb 18, 2017, at 5:06 PM, William Mikrut <wmikrut72 at gmail.com>
>>>>>> wrote:
>>>>>>>> Some slight re ordering of the code and it works perfectly!
>>>>>>>> 48 Bytes total, Less 17 for storage -- 31 program bytes to get
>>>>>>>> the job
>>>>>> done.
>>>>>>>> My original code was 61 program bytes... down to half the size and
>>>>>>>> does
>>>>>> the
>>>>>>>> exact same thing.
>>>>>>>> Absolutely amazing!
>>>>>>>>
>>>>>>>>
>>>>>>>> ORG $1200
>>>>>>>> CCD     RMB 16
>>>>>>>> RESULT  RMB 1
>>>>>>>>
>>>>>>>> START   LEAX CCD+16,PCR
>>>>>>>> CLRA
>>>>>>>>        LDB #8
>>>>>>>>
>>>>>>>>
>>>>>>>> LOOP    ADDA ,-X
>>>>>>>>        DAA
>>>>>>>>        PSHS A
>>>>>>>>        LDA ,-X
>>>>>>>>        LSLA
>>>>>>>>        CMPA #10
>>>>>>>>        BLO LOOP2
>>>>>>>>        SUBA #9
>>>>>>>> LOOP2   ADDA ,S+
>>>>>>>>        DAA
>>>>>>>>
>>>>>>>>        DECB
>>>>>>>>        BNE LOOP
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>        ANDA #$0F
>>>>>>>>        STA RESULT,PCR
>>>>>>>> ENDPGM  RTS
>>>>>>>> END START
>>>>>>>>
>>>>>>>>> On Sat, Feb 18, 2017 at 1:03 PM, William Mikrut
>>>>>>>>> <wmikrut72 at gmail.com>
>>>>>> wrote:
>>>>>>>>> You are right -- I looked at is closer.
>>>>>>>>> One thing I need to do is reverse the order of operations.
>>>>>>>>>
>>>>>>>>> The LSLA is performed first.
>>>>>>>>> First I need to store the byte and LSLA the next byte.
>>>>>>>>>
>>>>>>>>> Otherwise if I flip it from left to right:
>>>>>>>>> (LEAX CCD,PCR
>>>>>>>>> ...
>>>>>>>>> LDA ,X+
>>>>>>>>> ...
>>>>>>>>> ADDA ,X+)
>>>>>>>>>
>>>>>>>>> it works perfectly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Sat, Feb 18, 2017 at 11:35 AM, William Astle <lost at l-w.ca>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Take a closer look. It only does the LSLA on every other
>>>>>>>>>> digit. It
>>>>>> does
>>>>>>>>>> *two* digits  per loop, just like Brett's version.
>>>>>>>>>>
>>>>>>>>>> You can easily pretend all numbers are 16 digits by right
>>>>>>>>>> justifying
>>>>>> the
>>>>>>>>>> numbers in your buffer and padding with zeros.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 2017-02-18 10:06 AM, William Mikrut wrote:
>>>>>>>>>>>
>>>>>>>>>>> I like how this works from right to left.
>>>>>>>>>>> The only issue is the LSLA on every number.
>>>>>>>>>>>
>>>>>>>>>>> The algo is to double every other number, starting with the
>>>>>>>>>>> right
>>>>>> most
>>>>>>>>>>> digit, and sub 9 if the result is 10 or more.
>>>>>>>>>>>
>>>>>>>>>>> Now if the number is always 16 digits, Brett's 16 bit word seems
>>>>>>>>>>> the
>>>>>>>>>>> easiest way to go.
>>>>>>>>>>> If the number is 13 digits long the 16 bit word method won't
>>>>>>>>>>> work,
>>>>>> but I
>>>>>>>>>>> am
>>>>>>>>>>> happy to pretend all numbers are 16 digits!
>>>>>>>>>>>
>>>>>>>>>>> I am going to try to include a couple things you showed me into
>>>>>> Brett's
>>>>>>>>>>> 16
>>>>>>>>>>> bit chunk method and try a slightly different routine!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Feb 18, 2017 at 10:22 AM, William Astle <lost at l-w.ca>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 2017-02-18 12:43 AM, msmcdoug wrote:
>>>>>>>>>>>> Actually I'm surprised noone has suggested bcd arithmetic on
>>>>>>>>>>>> the
>>>>>> result
>>>>>>>>>>>>> to eliminate divide by 10 loop
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> BCD would certainly give a predictable overall cycle count. It
>>>>>>>>>>>> would
>>>>>>>>>>>> require a significantly different approach, though. The only
>>>>>> register
>>>>>>>>>>>> you
>>>>>>>>>>>> can use for BCD arithmetic is A and DAA is only useful after
>>>>>>>>>>>> ADDA or
>>>>>>>>>>>> ADCA.
>>>>>>>>>>>>
>>>>>>>>>>>> I had thought about using BCD but had initially dismissed it
>>>>>>>>>>>> due to
>>>>>>>>>>>> possible complexity. However, upon reflection, the extra
>>>>>>>>>>>> cycles to
>>>>>> use
>>>>>>>>>>>> BCD
>>>>>>>>>>>> would probably be less than the average cycle time of the
>>>>>>>>>>>> modulus
>>>>>> loop
>>>>>>>>>>>> combined or checking for digit overflow during the loop.
>>>>>>>>>>>>
>>>>>>>>>>>> I think you could use code that looks something like the
>>>>>>>>>>>> following
>>>>>> which
>>>>>>>>>>>> is based off Mr. Mikrut's most recent posted code. (warning:
>>>>>>>>>>>> mailer
>>>>>>>>>>>> code™
>>>>>>>>>>>> follows so it may have errors)
>>>>>>>>>>>>
>>>>>>>>>>>>        ORG $1200
>>>>>>>>>>>> CCD     RMB 16
>>>>>>>>>>>> RESULT  RMB 1
>>>>>>>>>>>> START   LEAX CCD+16,PCR
>>>>>>>>>>>>        CLRA
>>>>>>>>>>>>        LDB #8
>>>>>>>>>>>> LOOP    PSHS A
>>>>>>>>>>>>        LDA ,-X
>>>>>>>>>>>>        LSLA
>>>>>>>>>>>>        CMPA #10
>>>>>>>>>>>>        BLO LOOP2
>>>>>>>>>>>>        SUBA #9
>>>>>>>>>>>> LOOP2   ADDA ,S+
>>>>>>>>>>>>        DAA
>>>>>>>>>>>>        ADDA ,-X
>>>>>>>>>>>>        DAA
>>>>>>>>>>>>        DECB
>>>>>>>>>>>>        BNE LOOP
>>>>>>>>>>>>        ANDA #$0F
>>>>>>>>>>>>        STA RESULT,PCR
>>>>>>>>>>>> ENDPGM  RTS
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using the stack for a temporary storage location instead of
>>>>>>>>>>>> something
>>>>>>>>>>>> PCR relative for code size reasons. You could use the "RESULT
>>>>>> variable
>>>>>>>>>>>> for
>>>>>>>>>>>> the temporary to eliminate stack usage. That would probably be
>>>>>> slightly
>>>>>>>>>>>> faster at the expense of two more code bytes. This is one of
>>>>>>>>>>>> those
>>>>>>>>>>>> size/speed trade-offs.
>>>>>>>>>>>>
>>>>>>>>>>>> DAA has to be used after every addition and only applies to A.
>>>>>> Using BCD
>>>>>>>>>>>> means we can eliminate the mod 10 loop and just mask off the
>>>>>>>>>>>> upper
>>>>>> digit
>>>>>>>>>>>> (BCD stores two decimal digits in a byte). That gives a
>>>>>>>>>>>> constant
>>>>>> time
>>>>>>>>>>>> for
>>>>>>>>>>>> the "mod 10" result and also only takes 2 bytes (and 2 cycles).
>>>>>>>>>>>>
>>>>>>>>>>>> I have also eliminated the STATUS variable and just store the
>>>>>> result.
>>>>>>>>>>>> You
>>>>>>>>>>>> can test RESULT for non-zero trivially so there's no need for a
>>>>>> separate
>>>>>>>>>>>> STATUS value.
>>>>>>>>>>>>
>>>>>>>>>>>> By my calculation, this version is 32 bytes, requires 1 byte of
>>>>>> stack
>>>>>>>>>>>> space, 17 bytes of data space, and runs in a maximum of 351
>>>>>>>>>>>> cycles
>>>>>> (and
>>>>>>>>>>>> a
>>>>>>>>>>>> minimum of 336 cycles if none of the doubled digits goes
>>>>>>>>>>>> above 9).
>>>>>> For
>>>>>>>>>>>> this
>>>>>>>>>>>> analysis, I've assumed 8 bit offsets for the PCR references. 16
>>>>>>>>>>>> bit
>>>>>>>>>>>> offsets
>>>>>>>>>>>> in PCR mode are quite a bit more expensive (4 extra cycles
>>>>>>>>>>>> and 1
>>>>>> extra
>>>>>>>>>>>> byte).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Coco mailing list
>>>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Coco mailing list
>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Coco mailing list
>>>>>>>> Coco at maltedmedia.com
>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>
>>>>>>> --
>>>>>>> Coco mailing list
>>>>>>> Coco at maltedmedia.com
>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Coco mailing list
>>>>>> Coco at maltedmedia.com
>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>