A Comparison of Scan Settings for Asian Languages

For the best results when scanning, I recommend adjusting the resolution and color settings to best fit how you plan to use the resulting PDF.

scan 01e

What do I do?

If I am scanning something in Chinese or Japanese for my research, I generally use 600 dpi with black and white settings, because this is the best combination of readability, scan speed, and OCR (Optical Character Recognition) results. If I want better readability, especially for a poorly printed text, I’ll use grayscale. The scan speed is slower, and the OCR results are not as good, but I will be able to zoom in and make out more details than I could with black and white. This is especially nice on the iPad with the retina display, which enables me to easily view text almost as well as I could with a magnifying lens and the original paper page.

OCR Results for PDFs

If you rely on Evernote to scan your PDFs for you, then you won’t receive a copy of the text (I cannot directly compare the results with Adobe’s), it is subject to your account’s “recognition” settings on the Web at www.evernote.com (mine are 日本語+English), and some PDFs will not be scanned because they are ineligible (see http://evernote.com/contact/support/kb/#/article/23169032).

All of the PDFs I uploaded (one-page B4 sized scans) were eventually OCR’d, but only the 300 dpi and 600 dpi grayscale PDFs seem to have worked well with Evernote’s scanning software. Evernote found both 平伏 and 両方 on the page, so in this case it performed better than Adobe (see below). I cannot explain why the black and white scans fared so poorly in Evernote.

OCR Results for PDFs (Adobe)

I used Adobe Acrobat Pro X on the Mac to OCR the text with Japanese as the primary language setting. The red-colored characters in brackets indicate places where the OCR misread the text. The scan settings are arranged here in order by how well Adobe’s OCR performed. I suspect that the grayscale scans did not do as well because they introduced more “noise” into the scanned images, though Evernote’s OCR did surprisingly well with the 300 dpi grayscale version, so I could be wrong about this.

600 dpi bw (1 error)
こいとりの百姓大に恐れ、平伏してわぴ[び]れ共聞入ず。人立も多きゅへ、水戸様辻番よりも棒を突出、両方の往来をとどめて棒を組合せたり。
300 dpi bw (2 errors)
こいとりの百姓大に恐れ、平伏してわぴ[び]れ共聞入ず。人立も多きゅへ、水戸様辻番よりも棒を突出、雨[両]方の往来をとどめて棒を組合せたり。
300 dpi gray (4 errors)
ζ[こ] いとりの百姓大に恐れ、卒[平]伏してわぴ[び]れ共間入ず。人立も多きゅへ、水戸様辻番よりも棒を突出、雨[両]方の往来をとどめて棒を組合せたり。
600 dpi gray (5 errors)
ζ[こ] いとりの百姓大に恐れ、卒[平]伏してわぴ[び]れ共聞入ず。人立も多きゅへ、水戸様辻番よりも棒を突出、雨[両]方の往来をとに[ど]めて棒を組合せたり。

PDF Scans

Below are screenshots of the scanned PDFs. If the grayscale scans had been adjusted to make them darker, they could have been made more legible. I also wonder if better contrast, especially for the 300 dpi grayscale scan, would have improved the OCR results. I will test this at a later date.

scan 01e

Evernote OCR Results for Images (Evernote screenshots)

Evernote’s results with the images (the screenshots on this page) were superior to Adobe’s OCR with the PDFs in some cases. You may notice that the highlighted sections in Evernote don’t line up perfectly, but they don’t line up well in Adobe either, so I think this is par for the course.

A search on 130109 for 平伏 (with Adobe, only the black and white scans produced the correct OCR result for this word)

scan 01e

A search on 130109 for 両方 (with Adobe, only the 600 dpi black and white scan produced the correct OCR result for this word)

scan 01e