[Coco] [Color Computer] the final stage of optimization
asa.rand at yahoo.com
Wed Nov 25 19:42:58 EST 2009
Thanks again for the explanations. I am going to be playing with this to see what I can do with it. In the meantime, and since the literals reference is not necessary for the purpose of reconstructing source code from I-Code, I have removed that code from decode. I will implement it later, since its primary function is as a programmer's aid for determining more efficient use of literals.
I have added arrays to decode, for use with the line reference and variable reference records. Based on your description here, they are not hash tables, but rather simple lookup tables. The references in the I-Code consist of a byte and a integer. It was easy to compare the byte reference to the byte field of the array elements to determine if they match. If they do, then the int value is compared. If there's a match it's a "duplicate", otherwise it's a new reference.
It's working tremendously. The instruction decode for decode itself is down from 2 hours to just under 18 minutes. createmsg decodes in <2 minutes.
Now to see what's involved in using those arrays to speed up the line reference sort, and begin work on the preliminary variable sort. I have much more work to do on the variables, and I'm thinking of splitting the data into 2 files to make the work go faster. I'll find out if it works that way, but I think it should help, since then the primary record will only have 3 or 4 fields, instead of 8 (now) and possibly 14 in the final version of the record type.
From: jdiffendaffer <jdiffendaffer at yahoo.com>
To: ColorComputer at yahoogroups.com
Sent: Tue, November 24, 2009 11:09:51 AM
Subject: [Coco] [Color Computer] the final stage of optimization
A hash is a data key normally involving a mathematical formula where the formula attempts to create a unique value for a specific data item. If that makes any sense. It does not have to be just math. It can combine part actual data, part math, or whatever yields a more unique value for the data set.
You have to be aware of potential pitfalls of a given formula.
If you just add the ASCII values together you end up with the same value for the following strings:
You also have very different strings that could have the same total unless you use very large numbers.
Odds are the above names won't happen with programmer generated names but program generated names could be a huge problem.
Using a CRC may work better since it has very low odds of repeating but there is still the potential for a collision (in this case two strings having the same CRC). One flaw in CRC is it depends on strings being the same length. If strings can vary in length you will probably have more collisions. All collisions result in comparing the actual file records. If you add a cache then disk reads are reduced considerably but not eliminated. I say that because collisions will always be the same records but the number may exceed the size of your cache.
A packed string reduces the number of bits to the minimum required for the required alphanumeric characters.
If your application only needed upper case letters then you only need enough bits to represent 26 values. 5 bits. Then the bits from different letters are shifted over so they are packed together.
Numeric only is 10 values, 4 bits.
If you need upper and lower case, 52 values. 6 bits.
Alpha numeric (upper, lower, numeric) 62. 6 bits.
Add special characters and you are out to 7 or 8 bits.
If you could squeeze the first few characters into a string, 5 bits saves 3 bits per character. Over a 32 bit number that's 12 bits or more than 2 more characters squeezed in.
6 bits saves 2 bits and more than 1 extra character fits in 32 bits.
The packed string is easy to implement and may be faster to calculate, but it will result in more disk hits than a CRC unless it represents more bytes of the string. This is especially true if there is a lot of repetition at the front of variable names. If names are very different... then it's very good.
Just remember, packed strings only works with strings and the more characters you must support the less return for the effort.
In your case... you have to look at the data do decide what works best for you.
---- quote ----------
James (it's nice to know your name),
The program is a decompiler, unpacker, decoder, however you want to say it. Basic09 compiles into intermediate code, and I am using that to reconstruct the source code.
I am still working at understanding exactly what a hash is, but I think I get it. I was thinking about an idea to create an array for the literals that included a field that would contain the numeric value of a string. That value would be the total of the numeric values of each character in the string. I'm not sure if that's what you mean by "packed ascii" or not, but it sounds like it might be the same thing. Would it be feasible, though? What are the chances that 2 different strings, even of differing lengths, could have the same value? I haven't started working on testing that idea yet, but I'm planning on it. If I could do that, then I could make the array for the literals only include 2 fields, one byte and one integer. Of course, dealing with real values would still be a trick, since a real uses 5 bytes, and I'm not sure how to reduce that to 2.
Coco mailing list
Coco at maltedmedia.com
More information about the Coco