[Coco] Mod10 Suggestions
Dave Philipsen
dave at davebiz.com
Sat Feb 18 23:57:38 EST 2017
Yeah, I think the BNE is one less cycle if the branch isn't taken, right?
Dave
On 2/18/2017 10:53 PM, William Astle wrote:
> It would be 8 BNEs actually. It's executed even for the last loop.
>
> BNE is 3 cycles and DECB is 2 cycles so 40 cycles total.
>
> You can also save a cycle for each "temporary" reference by just using
> RESULT as the temporary instead of using the stack. It's one byte
> longer but one cycle faster as long as RESULT is in range of an 8 bit
> offset from PC. That would be 2 cycles gained per iteration for a
> total of 16 cycles. It's faster to use the stack if a PCR access to
> result would need a 16 bit offset.
>
>
> On 2017-02-18 09:15 PM, Dave Philipsen wrote:
>> How much speed would you gain by completely eliminating 8 DECBs and 7
>> BNEs?:
>>
>> ORG $1200
>> CCD RMB 16
>> RESULT RMB 1
>>
>> START LEAX CCD+16,PCR
>> CLRA
>>
>> LOOP ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP2
>> SUBA #9
>> LOOP2 ADDA ,S+
>> DAA
>>
>> ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP3
>> SUBA #9
>> LOOP3 ADDA ,S+
>> DAA
>>
>> ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP4
>> SUBA #9
>> LOOP4 ADDA ,S+
>> DAA
>>
>> ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP5
>> SUBA #9
>> LOOP5 ADDA ,S+
>> DAA
>>
>> ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP6
>> SUBA #9
>> LOOP6 ADDA ,S+
>> DAA
>>
>> ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP7
>> SUBA #9
>> LOOP7 ADDA ,S+
>> DAA
>>
>> ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP8
>> SUBA #9
>> LOOP8 ADDA ,S+
>> DAA
>>
>> ADDA ,-X
>> DAA
>> PSHS A
>> LDA ,-X
>> LSLA
>> CMPA #10
>> BLO LOOP9
>> SUBA #9
>> LOOP9 ADDA ,S+
>> DAA
>>
>> ANDA #$0F
>> STA RESULT,PCR
>> ENDPGM RTS
>> END START
>>
>> On 2/18/2017 8:22 PM, William Mikrut wrote:
>>> Which is the beauty of this project.
>>>
>>> Clearly there are at least 3 ways to do this...each with a slightly
>>> different outcome.
>>>
>>> Some optimization for size,speed... or both.
>>>
>>> There is a wealth of information and experience here from everone and I
>>> truly appreciate all the input!
>>>
>>> I can't wait to start the next project and see where it leads!!
>>>
>>>
>>>
>>> On Feb 18, 2017 8:10 PM, "L. Curtis Boyle" <curtisboyle at sasktel.net>
>>> wrote:
>>>
>>>> I was just going to mention that if speed is more important, doing an
>>>> leas
>>>> -1,s before the loop, and then just a sta ,a /adda ,s (instead of pshs
>>>> a/add ,s+), and then a final leas 1,s after the loop is done would be
>>>> a bit
>>>> longer, but a bit faster.
>>>>
>>>> L. Curtis Boyle
>>>> curtisboyle at sasktel.net
>>>>
>>>> TRS-80 Color Computer Games website
>>>> http://www.lcurtisboyle.com/nitros9/coco_game_list.html
>>>>
>>>>
>>>>
>>>>> On Feb 18, 2017, at 7:41 PM, Dave Philipsen <dave at davebiz.com> wrote:
>>>>>
>>>>> That's pretty well optimized! Have you ever considered the
>>>>> difference
>>>> between optimizing for size and optimizing for speed? So, for
>>>> instance, if
>>>> you weren't necessarily constrained for size but you knew you were
>>>> going to
>>>> process a list of jillions of cc numbers would you write it
>>>> differently?
>>>>> Dave Philipsen
>>>>>
>>>>>> On Feb 18, 2017, at 5:06 PM, William Mikrut <wmikrut72 at gmail.com>
>>>> wrote:
>>>>>> Some slight re ordering of the code and it works perfectly!
>>>>>> 48 Bytes total, Less 17 for storage -- 31 program bytes to get
>>>>>> the job
>>>> done.
>>>>>> My original code was 61 program bytes... down to half the size and
>>>>>> does
>>>> the
>>>>>> exact same thing.
>>>>>> Absolutely amazing!
>>>>>>
>>>>>>
>>>>>> ORG $1200
>>>>>> CCD RMB 16
>>>>>> RESULT RMB 1
>>>>>>
>>>>>> START LEAX CCD+16,PCR
>>>>>> CLRA
>>>>>> LDB #8
>>>>>>
>>>>>>
>>>>>> LOOP ADDA ,-X
>>>>>> DAA
>>>>>> PSHS A
>>>>>> LDA ,-X
>>>>>> LSLA
>>>>>> CMPA #10
>>>>>> BLO LOOP2
>>>>>> SUBA #9
>>>>>> LOOP2 ADDA ,S+
>>>>>> DAA
>>>>>>
>>>>>> DECB
>>>>>> BNE LOOP
>>>>>>
>>>>>>
>>>>>>
>>>>>> ANDA #$0F
>>>>>> STA RESULT,PCR
>>>>>> ENDPGM RTS
>>>>>> END START
>>>>>>
>>>>>>> On Sat, Feb 18, 2017 at 1:03 PM, William Mikrut
>>>>>>> <wmikrut72 at gmail.com>
>>>> wrote:
>>>>>>> You are right -- I looked at is closer.
>>>>>>> One thing I need to do is reverse the order of operations.
>>>>>>>
>>>>>>> The LSLA is performed first.
>>>>>>> First I need to store the byte and LSLA the next byte.
>>>>>>>
>>>>>>> Otherwise if I flip it from left to right:
>>>>>>> (LEAX CCD,PCR
>>>>>>> ...
>>>>>>> LDA ,X+
>>>>>>> ...
>>>>>>> ADDA ,X+)
>>>>>>>
>>>>>>> it works perfectly.
>>>>>>>
>>>>>>>
>>>>>>>> On Sat, Feb 18, 2017 at 11:35 AM, William Astle <lost at l-w.ca>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Take a closer look. It only does the LSLA on every other digit. It
>>>> does
>>>>>>>> *two* digits per loop, just like Brett's version.
>>>>>>>>
>>>>>>>> You can easily pretend all numbers are 16 digits by right
>>>>>>>> justifying
>>>> the
>>>>>>>> numbers in your buffer and padding with zeros.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 2017-02-18 10:06 AM, William Mikrut wrote:
>>>>>>>>>
>>>>>>>>> I like how this works from right to left.
>>>>>>>>> The only issue is the LSLA on every number.
>>>>>>>>>
>>>>>>>>> The algo is to double every other number, starting with the right
>>>> most
>>>>>>>>> digit, and sub 9 if the result is 10 or more.
>>>>>>>>>
>>>>>>>>> Now if the number is always 16 digits, Brett's 16 bit word seems
>>>>>>>>> the
>>>>>>>>> easiest way to go.
>>>>>>>>> If the number is 13 digits long the 16 bit word method won't
>>>>>>>>> work,
>>>> but I
>>>>>>>>> am
>>>>>>>>> happy to pretend all numbers are 16 digits!
>>>>>>>>>
>>>>>>>>> I am going to try to include a couple things you showed me into
>>>> Brett's
>>>>>>>>> 16
>>>>>>>>> bit chunk method and try a slightly different routine!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Feb 18, 2017 at 10:22 AM, William Astle <lost at l-w.ca>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 2017-02-18 12:43 AM, msmcdoug wrote:
>>>>>>>>>> Actually I'm surprised noone has suggested bcd arithmetic on the
>>>> result
>>>>>>>>>>> to eliminate divide by 10 loop
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> BCD would certainly give a predictable overall cycle count. It
>>>>>>>>>> would
>>>>>>>>>> require a significantly different approach, though. The only
>>>> register
>>>>>>>>>> you
>>>>>>>>>> can use for BCD arithmetic is A and DAA is only useful after
>>>>>>>>>> ADDA or
>>>>>>>>>> ADCA.
>>>>>>>>>>
>>>>>>>>>> I had thought about using BCD but had initially dismissed it
>>>>>>>>>> due to
>>>>>>>>>> possible complexity. However, upon reflection, the extra
>>>>>>>>>> cycles to
>>>> use
>>>>>>>>>> BCD
>>>>>>>>>> would probably be less than the average cycle time of the
>>>>>>>>>> modulus
>>>> loop
>>>>>>>>>> combined or checking for digit overflow during the loop.
>>>>>>>>>>
>>>>>>>>>> I think you could use code that looks something like the
>>>>>>>>>> following
>>>> which
>>>>>>>>>> is based off Mr. Mikrut's most recent posted code. (warning:
>>>>>>>>>> mailer
>>>>>>>>>> codeā¢
>>>>>>>>>> follows so it may have errors)
>>>>>>>>>>
>>>>>>>>>> ORG $1200
>>>>>>>>>> CCD RMB 16
>>>>>>>>>> RESULT RMB 1
>>>>>>>>>> START LEAX CCD+16,PCR
>>>>>>>>>> CLRA
>>>>>>>>>> LDB #8
>>>>>>>>>> LOOP PSHS A
>>>>>>>>>> LDA ,-X
>>>>>>>>>> LSLA
>>>>>>>>>> CMPA #10
>>>>>>>>>> BLO LOOP2
>>>>>>>>>> SUBA #9
>>>>>>>>>> LOOP2 ADDA ,S+
>>>>>>>>>> DAA
>>>>>>>>>> ADDA ,-X
>>>>>>>>>> DAA
>>>>>>>>>> DECB
>>>>>>>>>> BNE LOOP
>>>>>>>>>> ANDA #$0F
>>>>>>>>>> STA RESULT,PCR
>>>>>>>>>> ENDPGM RTS
>>>>>>>>>>
>>>>>>>>>> I'm using the stack for a temporary storage location instead of
>>>>>>>>>> something
>>>>>>>>>> PCR relative for code size reasons. You could use the "RESULT
>>>> variable
>>>>>>>>>> for
>>>>>>>>>> the temporary to eliminate stack usage. That would probably be
>>>> slightly
>>>>>>>>>> faster at the expense of two more code bytes. This is one of
>>>>>>>>>> those
>>>>>>>>>> size/speed trade-offs.
>>>>>>>>>>
>>>>>>>>>> DAA has to be used after every addition and only applies to A.
>>>> Using BCD
>>>>>>>>>> means we can eliminate the mod 10 loop and just mask off the
>>>>>>>>>> upper
>>>> digit
>>>>>>>>>> (BCD stores two decimal digits in a byte). That gives a constant
>>>> time
>>>>>>>>>> for
>>>>>>>>>> the "mod 10" result and also only takes 2 bytes (and 2 cycles).
>>>>>>>>>>
>>>>>>>>>> I have also eliminated the STATUS variable and just store the
>>>> result.
>>>>>>>>>> You
>>>>>>>>>> can test RESULT for non-zero trivially so there's no need for a
>>>> separate
>>>>>>>>>> STATUS value.
>>>>>>>>>>
>>>>>>>>>> By my calculation, this version is 32 bytes, requires 1 byte of
>>>> stack
>>>>>>>>>> space, 17 bytes of data space, and runs in a maximum of 351
>>>>>>>>>> cycles
>>>> (and
>>>>>>>>>> a
>>>>>>>>>> minimum of 336 cycles if none of the doubled digits goes
>>>>>>>>>> above 9).
>>>> For
>>>>>>>>>> this
>>>>>>>>>> analysis, I've assumed 8 bit offsets for the PCR references. 16
>>>>>>>>>> bit
>>>>>>>>>> offsets
>>>>>>>>>> in PCR mode are quite a bit more expensive (4 extra cycles and 1
>>>> extra
>>>>>>>>>> byte).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Coco mailing list
>>>>>>>>>> Coco at maltedmedia.com
>>>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> --
>>>>>>>> Coco mailing list
>>>>>>>> Coco at maltedmedia.com
>>>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Coco mailing list
>>>>>> Coco at maltedmedia.com
>>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>
>>>>> --
>>>>> Coco mailing list
>>>>> Coco at maltedmedia.com
>>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>>
>>>>
>>>> --
>>>> Coco mailing list
>>>> Coco at maltedmedia.com
>>>> https://pairlist5.pair.net/mailman/listinfo/coco
>>>>
>>
>>
>
>
More information about the Coco
mailing list