[Coco] RainbowArchive . The Rainbow Archive Project

Michael Wayne Harwood michael at musicheadproductions.org
Thu Jun 16 14:21:19 EDT 2005


John,

You have some excellent points and I appreciate you sharing them.  Thank you
for volunteering - I think that it would definitely be worth it to have a
"first pass" OCR and I appreciate your volunteering!  I would be happy to
provide you with the original scans (300ppi24bit) on DVD or CD media for
this purpose if you would like to email me your address via a private email.


Would a spell check engine be useful for article text?  I would think we
would be able to cull out most of the obvious misspellings using this
technique, though it might not work as well for program listings.  The worst
case scenario would be that the output would be somewhat spelling and/or
grammar challenged but still useful for a quick search for article text,
etc.  Ideally I would think we would want to include the full text on a
single media so that a search across the entire publication would be easily
done - would you agree?

One other thing to note - Benoit Bleau has provided a fair amount of Rainbow
on Disk resources (kudos to Ben for his hard work!) and if we are able to
pull together the full range of Rainbow on Disk/Cassette that was originally
offered by Falsoft we will have a significant amount of program listings
that will not need to be OCR'd or proofread.


Regards,
Michael Harwood


-----Original Message-----
From: John R. Hogerhuis [mailto:jhoger at pobox.com] 
Sent: Thursday, June 16, 2005 11:39 AM
To: CoCoList for Color Computer Enthusiasts;
michael at musicheadproductions.org
Subject: RE: [Coco] RainbowArchive . The Rainbow Archive Project

Given that we can't add (a lot) more people, all you have to think about is
whether a software-only OCR would be sufficient. Personally, I think it
would add significant value. I volunteer to do such an OCR over all the
volumes... but no more than that unless (a lot) more volunteers can be
added. As to a study, we already have lots of examples of that. All of these
tools work reasonably well. Having used it on Thinking Forth, I believe I
can let you see the OCR work done on that using Transym OCR, before it was
cleaned up. Useful, but certainly not readable on its own.

My (informed) opinion, having done this before, is that a accurate OCR
without, say, one volunteer per issue do cleanup is simply not possible.

So there is not really a need for any study of an accurate OCR... no point
in volunteering to lead a study I already know the answer to. The problem is
the constraint of not being able to add enough volunteers to do it.
Volunteers ready to proofread are available. If you want to ask a question,
ask how many people are willing to proofread a given issue. If you rather
ask how many individuals are willing to completely proofread War and Peace
with random OCR errors, I don't think you'll get a lot of takers.

Given that we can't add (a lot) more people, all you have to think about is
whether a software-only OCR would be sufficient. Personally, I think it
would add significant value. I volunteer to do such an first-pass OCR over
all the volumes... but no more than that unless (a lot) more volunteers can
be added.

-- John.





More information about the Coco mailing list