[Coco] [Color Computer] 1984 djvu conversion

Mon May 18 23:46:43 EDT 2009

On Monday 18 May 2009, Bob Devries wrote:
>I have difficulty understanding why the PDF files of these magazines are so
>large.

That is because even though you are scanning text, its being saved as a 
bitmapped image and that can easily be 200-4000 times the size of the ascii 
text itself.  2000 characters can moderately well fill a magazine page.  The 
scan of that same page at good resolution can run over a gigabyte a page, and 
certainly over 100 megs even at lower resolutions.

>I have scanned some of my copies of Australian Rainbow. All pages except the
>front and back covers are black & white, and I scanned them at 300 DPI. They
>are readable... as much as the originals were readable (but that's another
>story)... but they are not text searchable.

Because they don't contain the text, but the bitmapped images of the 
characters.

>Of the 30 magazines I scanned, the largest PDF file size was 6939KB from 76
>pages (275mm x 205mm) and the smallest 1800KB from 60 pages.

And I have read just this spring, a 756 page book, published as a .pdf, that 
is 3965952 bytes.  That is just under 4 megabytes for 756 pages with only a 
few line art drawings as chapter headers.

Pdf, if it starts with the original text, first does a heavy compression on 
that text which can reduce text with a common 'dictionary' to less than 10% of 
its original size.  When it has to deal with what is almost random data in a 
scanned bitmap image, it is dealing with a relatively huge file per page, and 
will have extreme difficulty achieving even a 50% compression ratio with data 
that is comparatively random.

If you really want small files, the _only_ way to get them is to do the scan 
at 600 dpi or more giving an OCR program as much help as you can, and then OCR 
it to convert it back into the ascii text, possibly with some markup if the 
OCR can truly recognize the precise font used in the original.  OCR's aren't 
that good, and likely never will be.  Below 300dpi input they are not that 
accurate, 600 is better, and 1200 is even better.  The images will be huge at 
1200, but those can be nuked once the OCR is done and edited.  The OCR output 
will quite likely need corrections so its going to be very time consuming, 
particularly since many of the code listings published were printed by an 
early shack dot matrix printer, and those things don't have a separate el from 
a 1, nor a recognizable font that matches anything used in publishing, or even 
on this screen I'm looking at.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
	"Do you believe in intuition?"
	"No, but I have a strange feeling that someday I will."