给扫描PDF添加OCR文本层 - OcrMyPdf

2023-05-13 约 700 字预计阅读 2 分钟

以前网络上收集到不少扫描版本的PDF文档，没有文字层，非常蛋疼，偶然发现了这个开源工具，非常推荐

OCRmyPDF https://github.com/ocrmypdf/OCRmyPDF

Windows下安装

安装Python依赖项

1
2
3


conda create -n pdf_env python=3.11
# 执行这一步，会将运行ocrmypdf的第三方包都一并安装好
pip install pip install ocrmypdf

安装OCR等依赖项

1
2
3
4
5
6
7
8
9


# 以管理员身份运行cmd，执行以下命令：
@"%SystemRoot%\System32\WindowsPowerShell\v1.0\powershell.exe" -NoProfile -InputFormat None -ExecutionPolicy Bypass -Command "[System.Net.ServicePointManager]::SecurityProtocol = 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))" && SET "PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin"

# 或者以管理员身份运行PowerShell，执行以下命令：
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

choco install --pre tesseract
choco install ghostscript
choco install pngquant

由于默认安装tesseract只包含了English语言包，要想支持多种语言的话，从tessdata上下载其它语言包，解压里面的扩展名为traineddata的文件，复制到 C:\Program Files\Tesseract-OCR\tessdata 上

运行

1

ocrmypdf -l chi_sim --pdf-renderer tesseract --output-type pdf source.pdf ocr.pdf

源码调试

由于官方忽略了.vscode等ide配置，所以我fork了仓库，并将vscode的调试配置放在learn分支下

1
2


git clone https://github.com/YaoXuanZhi/OCRmyPDF
git checkout learn

用vscode打开仓库根目录，F5运行调试即可

FAQ

在Qt里如何加载Pdf？

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


# PyQt通过pdf.js来加载PDF文件，实现OCR文本选择层
# pip install PyQtWebEngine
import sys
from PyQt5 import QtCore, QtWidgets
from PyQt5.QtWebEngineWidgets import QWebEngineView

PDFJS = 'file:///D:/OpenSources/ScreenPinKit/src/third_party/pdfjs-4.6.82-dist/web/viewer.html'
# PDF = 'file:///path/to/my/sample.pdf'
PDF = 'file:///D:/OpenSources/ScreenPinKit/src/third_party/pdfjs-4.6.82-dist/web/compressed.tracemonkey-pldi-09.pdf'

class Window(QWebEngineView):
    def __init__(self):
        super().__init__()
        self.load(QtCore.QUrl.fromUserInput('%s?file=%s' % (PDFJS, PDF)))

if __name__ == '__main__':

    app = QtWidgets.QApplication(sys.argv)
    window = Window()
    window.setGeometry(600, 50, 800, 600)
    window.show()
    sys.exit(app.exec_())

后记

之所以想要源码调试它，是想从其之中汲取到添加OCR文本层的思路，后续会将这块的分析整理成博客分享