[Coco] Mod10 Suggestions

Sun Feb 19 00:02:46 EST 2017

Or maybe not, after all....

On 2/18/2017 10:57 PM, Dave Philipsen wrote:
> Yeah, I think the BNE is one less cycle if the branch isn't taken, right?
>
> Dave
>
>
> On 2/18/2017 10:53 PM, William Astle wrote:
>> It would be 8 BNEs actually. It's executed even for the last loop.
>>
>> BNE is 3 cycles and DECB is 2 cycles so 40 cycles total.
>>
>> You can also save a cycle for each "temporary" reference by just 
>> using RESULT as the temporary instead of using the stack. It's one 
>> byte longer but one cycle faster as long as RESULT is in range of an 
>> 8 bit offset from PC. That would be 2 cycles gained per iteration for 
>> a total of 16 cycles. It's faster to use the stack if a PCR access to 
>> result would need a 16 bit offset.
>>
>>
>> On 2017-02-18 09:15 PM, Dave Philipsen wrote:
>>> How much speed would you gain by completely eliminating 8 DECBs and 7
>>> BNEs?:
>>>
>>> ORG $1200
>>> CCD     RMB 16
>>> RESULT  RMB 1
>>>
>>> START   LEAX CCD+16,PCR
>>> CLRA
>>>
>>> LOOP    ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP2
>>>         SUBA #9
>>> LOOP2   ADDA ,S+
>>>         DAA
>>>
>>>         ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP3
>>>         SUBA #9
>>> LOOP3   ADDA ,S+
>>>         DAA
>>>
>>>         ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP4
>>>         SUBA #9
>>> LOOP4   ADDA ,S+
>>>         DAA
>>>
>>>         ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP5
>>>         SUBA #9
>>> LOOP5   ADDA ,S+
>>>         DAA
>>>
>>>         ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP6
>>>         SUBA #9
>>> LOOP6   ADDA ,S+
>>>         DAA
>>>
>>>         ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP7
>>>         SUBA #9
>>> LOOP7   ADDA ,S+
>>>         DAA
>>>
>>>         ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP8
>>>         SUBA #9
>>> LOOP8   ADDA ,S+
>>>         DAA
>>>
>>>         ADDA ,-X
>>>         DAA
>>>         PSHS A
>>>         LDA ,-X
>>>         LSLA
>>>         CMPA #10
>>>         BLO LOOP9
>>>         SUBA #9
>>> LOOP9   ADDA ,S+
>>>         DAA
>>>
>>>         ANDA #$0F
>>>         STA RESULT,PCR
>>> ENDPGM  RTS
>>> END START
>>>
>>> On 2/18/2017 8:22 PM, William Mikrut wrote:
>>>> Which is the beauty of this project.
>>>>
>>>> Clearly there are at least 3 ways to do this...each with a slightly
>>>> different outcome.
>>>>
>>>> Some optimization for size,speed... or both.
>>>>
>>>> There is a wealth of information and experience here from everone 
>>>> and I
>>>> truly appreciate all the input!
>>>>
>>>> I can't wait to start the next project and see where it leads!!
>>>>
>>>>
>>>>
>>>> On Feb 18, 2017 8:10 PM, "L. Curtis Boyle" <curtisboyle at sasktel.net>
>>>> wrote:
>>>>
>>>>> I was just going to mention that if speed is more important, doing an
>>>>> leas
>>>>> -1,s before the loop, and then just a sta ,a /adda ,s (instead of 
>>>>> pshs
>>>>> a/add ,s+), and then a final leas 1,s after the loop is done would be
>>>>> a bit
>>>>> longer, but a bit faster.
>>>>>
>>>>> L. Curtis Boyle
>>>>> curtisboyle at sasktel.net
>>>>>
>>>>> TRS-80 Color Computer Games website
>>>>> http://www.lcurtisboyle.com/nitros9/coco_game_list.html
>>>>>
>>>>>
>>>>>
>>>>>> On Feb 18, 2017, at 7:41 PM, Dave Philipsen <dave at davebiz.com> 
>>>>>> wrote:
>>>>>>
>>>>>> That's pretty well optimized!  Have you ever considered the 
>>>>>> difference
>>>>> between optimizing for size and optimizing for speed?  So, for
>>>>> instance, if
>>>>> you weren't necessarily constrained for size but you knew you were
>>>>> going to
>>>>> process a list of jillions of cc numbers would you write it 
>>>>> differently?
>>>>>> Dave Philipsen
>>>>>>
>>>>>>> On Feb 18, 2017, at 5:06 PM, William Mikrut <wmikrut72 at gmail.com>
>>>>> wrote:
>>>>>>> Some slight re ordering of the code and it works perfectly!
>>>>>>> 48 Bytes total, Less 17 for storage -- 31 program bytes to get 
>>>>>>> the job
>>>>> done.
>>>>>>> My original code was 61 program bytes... down to half the size and
>>>>>>> does
>>>>> the
>>>>>>> exact same thing.
>>>>>>> Absolutely amazing!
>>>>>>>
>>>>>>>
>>>>>>> ORG $1200
>>>>>>> CCD     RMB 16
>>>>>>> RESULT  RMB 1
>>>>>>>
>>>>>>> START   LEAX CCD+16,PCR
>>>>>>> CLRA
>>>>>>>        LDB #8
>>>>>>>
>>>>>>>
>>>>>>> LOOP    ADDA ,-X
>>>>>>>        DAA
>>>>>>>        PSHS A
>>>>>>>        LDA ,-X
>>>>>>>        LSLA
>>>>>>>        CMPA #10
>>>>>>>        BLO LOOP2
>>>>>>>        SUBA #9
>>>>>>> LOOP2   ADDA ,S+
>>>>>>>        DAA
>>>>>>>
>>>>>>>        DECB
>>>>>>>        BNE LOOP
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>        ANDA #$0F
>>>>>>>        STA RESULT,PCR
>>>>>>> ENDPGM  RTS
>>>>>>> END START
>>>>>>>
>>>>>>>> On Sat, Feb 18, 2017 at 1:03 PM, William Mikrut 
>>>>>>>> <wmikrut72 at gmail.com>
>>>>> wrote:
>>>>>>>> You are right -- I looked at is closer.
>>>>>>>> One thing I need to do is reverse the order of operations.
>>>>>>>>
>>>>>>>> The LSLA is performed first.
>>>>>>>> First I need to store the byte and LSLA the next byte.
>>>>>>>>
>>>>>>>> Otherwise if I flip it from left to right:
>>>>>>>> (LEAX CCD,PCR
>>>>>>>> ...
>>>>>>>> LDA ,X+
>>>>>>>> ...
>>>>>>>> ADDA ,X+)
>>>>>>>>
>>>>>>>> it works perfectly.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Sat, Feb 18, 2017 at 11:35 AM, William Astle <lost at l-w.ca> 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Take a closer look. It only does the LSLA on every other 
>>>>>>>>> digit. It
>>>>> does
>>>>>>>>> *two* digits  per loop, just like Brett's version.
>>>>>>>>>
>>>>>>>>> You can easily pretend all numbers are 16 digits by right 
>>>>>>>>> justifying
>>>>> the
>>>>>>>>> numbers in your buffer and padding with zeros.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 2017-02-18 10:06 AM, William Mikrut wrote:
>>>>>>>>>>
>>>>>>>>>> I like how this works from right to left.
>>>>>>>>>> The only issue is the LSLA on every number.
>>>>>>>>>>
>>>>>>>>>> The algo is to double every other number, starting with the 
>>>>>>>>>> right
>>>>> most
>>>>>>>>>> digit, and sub 9 if the result is 10 or more.
>>>>>>>>>>
>>>>>>>>>> Now if the number is always 16 digits, Brett's 16 bit word seems
>>>>>>>>>> the
>>>>>>>>>> easiest way to go.
>>>>>>>>>> If the number is 13 digits long the 16 bit word method won't 
>>>>>>>>>> work,
>>>>> but I
>>>>>>>>>> am
>>>>>>>>>> happy to pretend all numbers are 16 digits!
>>>>>>>>>>
>>>>>>>>>> I am going to try to include a couple things you showed me into
>>>>> Brett's
>>>>>>>>>> 16
>>>>>>>>>> bit chunk method and try a slightly different routine!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Feb 18, 2017 at 10:22 AM, William Astle <lost at l-w.ca>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2017-02-18 12:43 AM, msmcdoug wrote:
>>>>>>>>>>> Actually I'm surprised noone has suggested bcd arithmetic on 
>>>>>>>>>>> the
>>>>> result
>>>>>>>>>>>> to eliminate divide by 10 loop
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> BCD would certainly give a predictable overall cycle count. It
>>>>>>>>>>> would
>>>>>>>>>>> require a significantly different approach, though. The only
>>>>> register
>>>>>>>>>>> you
>>>>>>>>>>> can use for BCD arithmetic is A and DAA is only useful after
>>>>>>>>>>> ADDA or
>>>>>>>>>>> ADCA.
>>>>>>>>>>>
>>>>>>>>>>> I had thought about using BCD but had initially dismissed it
>>>>>>>>>>> due to
>>>>>>>>>>> possible complexity. However, upon reflection, the extra 
>>>>>>>>>>> cycles to
>>>>> use
>>>>>>>>>>> BCD
>>>>>>>>>>> would probably be less than the average cycle time of the 
>>>>>>>>>>> modulus
>>>>> loop
>>>>>>>>>>> combined or checking for digit overflow during the loop.
>>>>>>>>>>>
>>>>>>>>>>> I think you could use code that looks something like the 
>>>>>>>>>>> following
>>>>> which
>>>>>>>>>>> is based off Mr. Mikrut's most recent posted code. (warning:
>>>>>>>>>>> mailer
>>>>>>>>>>> code™
>>>>>>>>>>> follows so it may have errors)
>>>>>>>>>>>
>>>>>>>>>>>        ORG $1200
>>>>>>>>>>> CCD     RMB 16
>>>>>>>>>>> RESULT  RMB 1
>>>>>>>>>>> START   LEAX CCD+16,PCR
>>>>>>>>>>>        CLRA
>>>>>>>>>>>        LDB #8
>>>>>>>>>>> LOOP    PSHS A
>>>>>>>>>>>        LDA ,-X
>>>>>>>>>>>        LSLA
>>>>>>>>>>>        CMPA #10
>>>>>>>>>>>        BLO LOOP2
>>>>>>>>>>>        SUBA #9
>>>>>>>>>>> LOOP2   ADDA ,S+
>>>>>>>>>>>        DAA
>>>>>>>>>>>        ADDA ,-X
>>>>>>>>>>>        DAA
>>>>>>>>>>>        DECB
>>>>>>>>>>>        BNE LOOP
>>>>>>>>>>>        ANDA #$0F
>>>>>>>>>>>        STA RESULT,PCR
>>>>>>>>>>> ENDPGM  RTS
>>>>>>>>>>>
>>>>>>>>>>> I'm using the stack for a temporary storage location instead of
>>>>>>>>>>> something
>>>>>>>>>>> PCR relative for code size reasons. You could use the "RESULT
>>>>> variable
>>>>>>>>>>> for
>>>>>>>>>>> the temporary to eliminate stack usage. That would probably be
>>>>> slightly
>>>>>>>>>>> faster at the expense of two more code bytes. This is one of 
>>>>>>>>>>> those
>>>>>>>>>>> size/speed trade-offs.
>>>>>>>>>>>
>>>>>>>>>>> DAA has to be used after every addition and only applies to A.
>>>>> Using BCD
>>>>>>>>>>> means we can eliminate the mod 10 loop and just mask off the 
>>>>>>>>>>> upper
>>>>> digit
>>>>>>>>>>> (BCD stores two decimal digits in a byte). That gives a 
>>>>>>>>>>> constant
>>>>> time
>>>>>>>>>>> for
>>>>>>>>>>> the "mod 10" result and also only takes 2 bytes (and 2 cycles).
>>>>>>>>>>>
>>>>>>>>>>> I have also eliminated the STATUS variable and just store the
>>>>> result.
>>>>>>>>>>> You
>>>>>>>>>>> can test RESULT for non-zero trivially so there's no need for a
>>>>> separate
>>>>>>>>>>> STATUS value.
>>>>>>>>>>>
>>>>>>>>>>> By my calculation, this version is 32 bytes, requires 1 byte of
>>>>> stack
>>>>>>>>>>> space, 17 bytes of data space, and runs in a maximum of 351 
>>>>>>>>>>> cycles
>>>>> (and
>>>>>>>>>>> a
>>>>>>>>>>> minimum of 336 cycles if none of the doubled digits goes 
>>>>>>>>>>> above 9).
>>>>> For
>>>>>>>>>>> this
>>>>>>>>>>> analysis, I've assumed 8 bit offsets for the PCR references. 16
>>>>>>>>>>> bit
>>>>>>>>>>> offsets
>>>>>>>>>>> in PCR mode are quite a bit more expensive (4 extra cycles 
>>>>>>>>>>> and 1
>>>>> extra
>>>>>>>>>>> byte).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Coco mailing list
>>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Coco mailing list
>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>
>>>>>>>>
>>>>>>> -- 
>>>>>>> Coco mailing list
>>>>>>> Coco at maltedmedia.com
>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>
>>>>>> -- 
>>>>>> Coco mailing list
>>>>>> Coco at maltedmedia.com
>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>
>>>>>
>>>>> -- 
>>>>> Coco mailing list
>>>>> Coco at maltedmedia.com
>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>
>>>
>>>
>>
>>
>
>