给扫描PDF添加OCR文本层 - OcrMyPdf

以前网络上收集到不少扫描版本的PDF文档,没有文字层,非常蛋疼,偶然发现了这个开源工具,非常推荐

OCRmyPDF https://github.com/ocrmypdf/OCRmyPDF

Windows下安装

安装Python依赖项

1
2
3
conda create -n pdf_env python=3.11
# 执行这一步,会将运行ocrmypdf的第三方包都一并安装好
pip install pip install ocrmypdf

安装OCR等依赖项

1
2
3
4
5
6
7
8
9
# 以管理员身份运行cmd,执行以下命令:
@"%SystemRoot%\System32\WindowsPowerShell\v1.0\powershell.exe" -NoProfile -InputFormat None -ExecutionPolicy Bypass -Command "[System.Net.ServicePointManager]::SecurityProtocol = 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))" && SET "PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin"

# 或者以管理员身份运行PowerShell,执行以下命令:
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

choco install --pre tesseract
choco install ghostscript
choco install pngquant

由于默认安装tesseract只包含了English语言包,要想支持多种语言的话,从tessdata上下载其它语言包,解压里面的扩展名为traineddata的文件,复制到 C:\Program Files\Tesseract-OCR\tessdata

运行

1
ocrmypdf -l chi_sim --pdf-renderer tesseract --output-type pdf source.pdf ocr.pdf

源码调试

由于官方忽略了.vscode等ide配置,所以我fork了仓库,并将vscode的调试配置放在learn分支下

1
2
git clone https://github.com/YaoXuanZhi/OCRmyPDF
git checkout learn

用vscode打开仓库根目录,F5运行调试即可

尝试使用pdf.js实现

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import sys
from PyQt5 import QtCore, QtWidgets
from PyQt5.QtWebEngineWidgets import QWebEngineView
# pip install PyQtWebEngine

PDFJS = 'file:///D:/OpenSources/ScreenPinKit/src/third_party/pdfjs-4.6.82-dist/web/viewer.html'
# PDFJS = 'file:///path/to/pdfjs-1.9.426-dist/web/viewer.html'
# PDFJS = 'file:///usr/share/pdf.js/web/viewer.html'
# PDF = 'file:///path/to/my/sample.pdf'
PDF = 'file:///D:/OpenSources/ScreenPinKit/src/third_party/pdfjs-4.6.82-dist/web/compressed.tracemonkey-pldi-09.pdf'
# PDF = 'file:///D:/OpenSources/ScreenPinKit/src/third_party/Snipaste_2024-09-18_01-09-43.pdf'

class Window(QWebEngineView):
    def __init__(self):
        super().__init__()
        self.load(QtCore.QUrl.fromUserInput('%s?file=%s' % (PDFJS, PDF)))

if __name__ == '__main__':

    app = QtWidgets.QApplication(sys.argv)
    window = Window()
    window.setGeometry(600, 50, 800, 600)
    window.show()
    sys.exit(app.exec_())

后记

之所以想要源码调试它,是想从其之中汲取到添加OCR文本层的思路,后续会将这块的分析整理成博客分享

参考资料

0%