[Coco] [Color Computer] 1984 djvu conversion
Gene Heskett
gene.heskett at verizon.net
Mon May 18 23:46:43 EDT 2009
On Monday 18 May 2009, Bob Devries wrote:
>I have difficulty understanding why the PDF files of these magazines are so
>large.
That is because even though you are scanning text, its being saved as a
bitmapped image and that can easily be 200-4000 times the size of the ascii
text itself. 2000 characters can moderately well fill a magazine page. The
scan of that same page at good resolution can run over a gigabyte a page, and
certainly over 100 megs even at lower resolutions.
>I have scanned some of my copies of Australian Rainbow. All pages except the
>front and back covers are black & white, and I scanned them at 300 DPI. They
>are readable... as much as the originals were readable (but that's another
>story)... but they are not text searchable.
Because they don't contain the text, but the bitmapped images of the
characters.
>Of the 30 magazines I scanned, the largest PDF file size was 6939KB from 76
>pages (275mm x 205mm) and the smallest 1800KB from 60 pages.
And I have read just this spring, a 756 page book, published as a .pdf, that
is 3965952 bytes. That is just under 4 megabytes for 756 pages with only a
few line art drawings as chapter headers.
Pdf, if it starts with the original text, first does a heavy compression on
that text which can reduce text with a common 'dictionary' to less than 10% of
its original size. When it has to deal with what is almost random data in a
scanned bitmap image, it is dealing with a relatively huge file per page, and
will have extreme difficulty achieving even a 50% compression ratio with data
that is comparatively random.
If you really want small files, the _only_ way to get them is to do the scan
at 600 dpi or more giving an OCR program as much help as you can, and then OCR
it to convert it back into the ascii text, possibly with some markup if the
OCR can truly recognize the precise font used in the original. OCR's aren't
that good, and likely never will be. Below 300dpi input they are not that
accurate, 600 is better, and 1200 is even better. The images will be huge at
1200, but those can be nuked once the OCR is done and edited. The OCR output
will quite likely need corrections so its going to be very time consuming,
particularly since many of the code listings published were printed by an
early shack dot matrix printer, and those things don't have a separate el from
a 1, nor a recognizable font that matches anything used in publishing, or even
on this screen I'm looking at.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
"Do you believe in intuition?"
"No, but I have a strange feeling that someday I will."
More information about the Coco
mailing list