[Coco] disk files and program speed
Aaron Wolfe
aawolfe at gmail.com
Tue Nov 24 10:53:33 EST 2009
On Mon, Nov 23, 2009 at 6:39 PM, Wayne Campbell <asa.rand at yahoo.com> wrote:
> I have come to the conclusion that fast and disk files do not belong in the same sentence together. "fast disk search" is an oxymoron.
>
> I was going over the post from jdiffendaffer concerning searches. I knew that I was getting the entire record before comparing the fields, and thought, "What if I only get the fields that are to be compared? Surely getting 1 byte and 1 other field for comparison is faster than getting the whole record. Then, if a match is found, I can seek to the correct position and update the reference count." Well, the extra seeks involved in getting 2 fields of a record from a disk file slowed the program down alot. Decoding createmsg went from 2:10 to 2:49.
>
> It looks like the only way to speed the searches and sorts up is to find a way to deal with them in memory, at least the parts that need to be compared. I would use arrays, but there is a fundamental problem. Differing sizes of arrays. One procedure may contain 300 variable references, while another contains over 2000. One may contain no line references while another contains over 200. And the literals? Decode (before the string optimizations) contained 549 unique literal references, and over 1000 total. I wish there was a way to dynamically add to the elements of an array, but there isn't. Because of this, I'd have to establish arrays large enough to handle the largest number of references that could be contained in a procedure.
>
An array is not usually a good structure for this type of thing, for
the reasons you mention. Inserting an element in the middle of an
array is painfully slow. You often must know the dimensions prior to
starting. Finding an matching element can mean searching the entire
structure. In a nutshell, an array is the wrong structure for what
you want to do :)
I don't know the language you are using for this (I assume it's basic
09) but you might want to explore what other options for memory
structures it offers. A simple linked list can be an order of
magnitude faster for inserts. A hash can do lookups very very
quickly. Btrees can be fast if you need to find records that are
"near" the one you are searching on's sort position. You really might
want to learn about all these options and their strengths/weaknesses
before choosing one.
If your language of choice doesn't support anything but arrays, you
can implement these things within an array although it's extra work
for you and for the computer. You can get around the memory size
limitations, depending on the strategy you use, by loading/unloading
pages of the index structure in and out of memory. this might be a
good compromise between ram usage and speed.
good luck
-Aaron
> I have calculated that, even if the arrays only contain the fields for comparison, the variable reference array would have to be 600 minimum elements, each element composed of 1 byte and 1 integer, for a total of 1800 bytes. The line reference array would be structured similarly, but wouldn't need more than 300 elements, since I have yet to see a procedure with more than 150 line references, but that may not be the max I could run into. So, that's another 900 bytes. The literals references would need 2 bytes, 1 integer, 1 real and a string of 80 per element, and (based on decode) I would need at least 600 elements. That's 89 bytes times 600 elements, for a total of 53,400 bytes.
>
> Add all three together, and you get a total size, just for the arrays, of 56,100 bytes of data space. I don't think OS9 or NitrOS9 will allow data of that size in memory.
>
> You may ask, why do you need to count all those literals anyway? I knew, back when I wrote DCom, that I needed to find a way to correlate variables passed as parameters to the procedures they are passed to. While this is not necessary to the idea of reproducing the source code of any one procedure, it would help with figuring out how your procedures were related.
>
> Named subroutines are the easy ones. The procedure name exists in the subroutine section of the VDT. If you used a string variable to name the procedure to be called, all you get is a reference to a string variable. Without knowing what the name of the procedure is, it's impossible to match parameters between them.
>
> In addition, counting the literals can help show you where you can optimize the literals and reduce code size. While I still haven't gotten to the point where I can say literal X contains the name of the procedure called in RUN statement Y, I am getting closer to that mark.
>
> I can probably put sorting the literals (and dealing with that association) in a different program, but I still have to be able to identify them, and doing it while going through the first pass through the instruction code is the best time. But maybe, for the above reasons, it isn't. Maybe I should put literals on the shelf for now and concentrate on getting decode to correctly identify everything else.
>
> Wayne
>
>
>
>
> --
> Coco mailing list
> Coco at maltedmedia.com
> http://five.pairlist.net/mailman/listinfo/coco
>
More information about the Coco
mailing list