初探Cnocr

2021-04-08 约 3400 字预计阅读 7 分钟

在知乎上看到一篇不错的文章更轻量的中文OCR—— cnocr-V1.2.2 ：最小模型只有 4.7M，作者分享了一个体积小、横版中文识别速度快且准度高、部署简单且易用的中英文OCR Python库，因此试用下，结果发现还挺不错的，在它之上还搞了个自用的OCR工具

cnocr仓库地址 https://github.com/breezedeus/cnocr

在线安装

1. virtualenv/anaconda3
1. win32OpenSSL

1
2
3
4


conda create -n cnocr python=3.7
conda activate cnocr
pip install cnocr
pip install cnstd

注意，执行pip install cnstd之后，shapely的在线安装是有问题的，需要下载离线的shapely_xxx.whl文件(如Shapely-1.7.1-cp37-cp37m-win_amd64.whl)进行离线安装，如下所示：

1
2


pip uninstall Shapely
pip install your_path/Shapely-1.7.1-cp37-cp37m-win_amd64.whl

源码安装

1
2
3
4


cd cnstd/
pip install -r requestment.txt
python setup.py build
python setup.py install

离线包部署

如果不想折腾这个cnocr的环境配置，这里提供了一个本人整合的离线工具包，只需要在一个不包含中文或空格的路径上解压，直接以管理员身份执行ocr_deploy.bat即可

下载链接：cnocr离线工具包-cnocr_toolkit.zip，提取码：y80u

该离线包仅仅支持x64bit Windows系统

代码片段

纯文本识别

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# 识别纯文字图片
def ocr_single_png_from_cache(image):
    nd_array = np.asarray(image.convert('RGB'))
    res = cn_ocr.ocr(nd_array) 
    res = [''.join(line_p) for line_p in res]
    temp_res = '\n'.join(res)
    return temp_res

# 识别纯文字图片
def ocr_single_png_from_cache_bak(image):
    img_bytes = BytesIO()
    image.save(img_bytes, format='PNG')
    nd_array = mx.image.imdecode(img_bytes.getvalue())
    res = cn_ocr.ocr(nd_array) 
    res = [''.join(line_p) for line_p in res]
    temp_res = '\n'.join(res)
    return temp_res

外部调用cnocr

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# 读取shell命令输出结果
def popen_wrapper(cmd_line):
    p = subprocess.Popen(cmd_line, shell=True, stdout=subprocess.PIPE)
    lines = p.stdout.readlines()[1:]
    temp_str = ""
    for str in lines:
        utf8_str = str.decode("utf8")
        temp_str = temp_str + utf8_str
    return temp_str

# 尝试进行OCR识别
def try_ocr_image(image_path):
    cmd_line = "xxx/try_ocr_image.bat %s" % (image_path)
    ocr_str = popen_wrapper(cmd_line).strip()
    return ocr_str

剪贴板OCR

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


# -*- coding: utf-8 -*-
import win32.win32clipboard as win32clipboard
from PIL import Image, ImageDraw, ImageGrab
import os

# 读取PIL image进行OCR识别
def ocr_image_from_pil(image):
    from cnocr import CnOcr 
    import numpy as np
    cn_ocr = CnOcr() 
    nd_array = np.asarray(image.convert('RGB'))
    res_lines = cn_ocr.ocr(nd_array) 
    res = [''.join(line_p) for line_p in res_lines]
    temp_res = '\n'.join(res)
    return temp_res

# 读取剪贴板上的图像数据进行OCR识别
def try_ocr_clipboard():
    ocr_result = ""
    im = ImageGrab.grabclipboard()
    if not im == None:
        return ocr_image_from_pil(im)
    return ocr_result

def set_text_to_clipboard(text):
    text_bytes = bytes(text, encoding="utf8")
    win32clipboard.OpenClipboard()
    win32clipboard.SetClipboardText(text)
    win32clipboard.CloseClipboard()

def ocr_clipboard():
    ocr_result = try_ocr_clipboard()
    if (len(ocr_result) > 0):
        set_text_to_clipboard(ocr_result)
        print("剪贴板OCR结果：\n%s" % (ocr_result))
        os.system("pause")

ocr_clipboard()

剪贴板OCR工具-ocr_clipboard.zip

注意，Tim上有一个屏幕OCR功能，只需要Ctrl+Alt+O即可唤出，识别准度更高

训练新字体

生成新字体的训练集

先通过以下脚本来快速生成字体的训练集

pre_train_for_font.py

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162


# !/user/bin/env python
# -*- coding:utf-8 -*- 

import os
import argparse
from PIL import Image,ImageFont,ImageDraw
  
# 根据文本生成图片
def save_chars_image(text, image_path, font, is_debug = False):
    chars_x, chars_y = 0, 0
    chars_w, chars_h = font.getsize(text)
  
    if is_debug == True:
        chars_w = chars_w + 2
        chars_h = chars_h + 2
  
    im = Image.new("RGB", (chars_w, chars_h), (255, 255, 255))
    dr = ImageDraw.Draw(im)
  
    # 绘制文字边框
    if is_debug == True:
        coords = [(chars_x+1, chars_y+1), (chars_x+1, chars_y+chars_h-1),
                (chars_x+chars_w-1, chars_y+chars_h-1), (chars_x+chars_w-1,chars_y+1)]
        dr.polygon(coords, outline=(255, 0, 10))
  
    # 居中绘制文字
    dr.text((chars_x, chars_y), text, font=font, fill=(0,0,0), align='center')
    im.save(image_path)
 
def indexing(standards, new_chars, text):
    res = []
    for i in range(len(text)):
        try:
            res.append(standards.index(text[i])+1)
        except:
            new_chars.append(text[i])
            res.append(len(standards)+len(new_chars)+1)
    return res
  
def clear_invalid_chars(char_array):
    for i in range(len(char_array)):
        char_array[i] = char_array[i].strip('\n')

def main():
    parser = argparse.ArgumentParser(description='生成用于CnOcr训练的数据集')

    parser.add_argument("-root", "--root_dir",
        default="data",
        type=str,
        help="预训练配置目录",
	)

    parser.add_argument("-examples", "--examples_dir",
        default="examples",
        type=str,
        help="图片样本所在目录",
	)

    parser.add_argument("-font", "--font_path",
        default="fonts/卷卷桃心中文字体.ttf",
        type=str,
        help="待训练的字体路径",
	)

    parser.add_argument("-font_size", "--font_size",
        default=20,
        type=int,
        help="待训练的字体大小",
	)

    parser.add_argument("-label", "--label_path",
        default="label_cn.txt",
        type=str,
        help="文本原料",
	)

    parser.add_argument("-train", "--train_name",
        default="train.txt",
        type=str,
        help="训练样本名",
	)

    parser.add_argument("-test", "--test_name",
        default="test.txt",
        help="测试样本名",
	)

    parser.add_argument("-is_test", "--is_test",
        action="store_true",
        help="是否生成测试图片",
	)

    parser.add_argument("-test_text", "--test_text",
        default="",
        help="测试文本",
	)

    args = parser.parse_args()

    root_dir = args.root_dir
    images_dir = args.examples_dir
  
    label_path = args.label_path
    train_path = args.train_name
    test_path = args.test_name
  
    font = ImageFont.truetype(args.font_path, args.font_size)
 
    label_file = open(label_path, 'r', encoding='utf-8')
 
    train_file = open(os.path.join(root_dir, train_path), 'w', encoding='utf-8')
    test_file = open(os.path.join(root_dir, test_path), 'w', encoding='utf-8')
 
    standards = label_file.readlines()
    clear_invalid_chars(standards)
  
    new_chars = []
 
    label_file.close()

    if args.is_test and len(args.test_text) > 0:
        image_path = "test.png"
        save_chars_image(args.test_text, image_path, font=font)
        return
  
    # 生成用于训练的图片集
    for i in range(len(standards)):
        text = standards[i]
        idxes = indexing(standards, new_chars, text)
 
        cnt = "train_%06d.jpg" % (i+1)
        image_path = os.path.join(images_dir, cnt)
        save_chars_image(text, image_path, font=font)
  
        for idx in idxes:
            cnt = cnt + " {}".format(idx)
        train_file.write(cnt+'\n')
    train_file.close()
 
    # 生成用于测试的图片集
    for i in range(len(standards)):
        if (i+1) % 30 == 0:
            text = standards[i]
            idxes = indexing(standards, new_chars, text)
 
            cnt = "test_%06d.jpg" % (i+1)
            image_path = os.path.join(images_dir, cnt)
            save_chars_image(text, image_path, font=font)
  
            for idx in idxes:
                cnt = cnt + " {}".format(idx)
            test_file.write(cnt+'\n') 
    test_file.close()
 
    # 追加新增字符
    label_file = open(label_path, 'a', encoding='utf-8')
    for new_char in new_chars:
        label_file.write(new_char+'\n')
    label_file.close()

if __name__ == '__main__':
    main()

以下是生成某种字体对应的训练集的批处理

1
2
3
4
5
6
7
8
9


@echo off

call "%~dp0..\set_ocr_env.bat"

cd /d %~dp0

python scripts/pre_train_for_font.py -font_size 40 -font fonts/中国式手写风字体.ttf -root data/sample-data

pause

在已有的模型上迭代训练

官方在cnocr-训练自己的模型下已经介绍基本的训练流程了，但是这里面有一些训练的细节没有提及

1
2
3
4


train:
    python scripts/cnocr_train.py --gpu 0 --emb_model_type $(EMB_MODEL_TYPE) --seq_model_type $(SEQ_MODEL_TYPE) \
        --optimizer adam --epoch 20 --lr 1e-4 \
        --train_file $(REC_DATA_ROOT_DIR)/sample-data_train --test_file $(REC_DATA_ROOT_DIR)/sample-data_test

根据这个训练参数--epoch 20来看，会进行20次全量训练，并且由于没有指定--load_epoch nums，所以是从头开始训练的，那么要想在已有的训练模型上进行增量训练集的训练，应该怎么做呢？

使用--load_epoch start_times --epoch times，会从指定的迭代模型上，对增量训练集进行迭代训练

使用某次迭代的训练模型来进行OCR识别

使用--model-epoch times参数来进行OCR识别，这个参数在evaluate也有使用到

为了在windows下也方便进行迭代训练，这边基于官方提供的Makefile文件，将其改造成了一个Makefile.bat

Makefile.bat

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112


@echo off

call "%~dp0..\set_ocr_env.bat"

title cnstd-train

set MXNET_CPU_WORKER_NTHREADS=2
set DATA_ROOT_DIR=data/sample-data
set REC_DATA_ROOT_DIR=data/sample-data-lst
set IMAGES_DIR=examples

:: `EMB_MODEL_TYPE` 可取值：['conv', 'conv-lite-rnn', 'densenet', 'densenet-lite']
set EMB_MODEL_TYPE=conv-lite
:: `SEQ_MODEL_TYPE` 可取值：['lstm', 'gru', 'fc']
set SEQ_MODEL_TYPE=fc
set MODEL_NAME=%EMB_MODEL_TYPE%-%SEQ_MODEL_TYPE%

@REM ------------------------------------------------------------------

:do

cls

echo 1: gen_lst

echo 2: gen_rec

echo 3: train

echo 4: evaluate

echo 5: predict

echo 6: package

echo 7: upload

set /p o=

if %o%==1 goto gen_lst

if %o%==2 goto gen_rec

if %o%==3 goto train

if %o%==4 goto evaluate

if %o%==5 goto predict

if %o%==6 goto package

if %o%==7 goto upload

goto end

@REM ------------------------------------------------------------------

:: 产生 *.lst 文件
:gen_lst
	echo python scripts/im2rec.py --list --num-label 20 --chunks 1 --train-idx-fp %DATA_ROOT_DIR%/train.txt --test-idx-fp %DATA_ROOT_DIR%/test.txt --prefix %REC_DATA_ROOT_DIR%/sample-data
	python scripts/im2rec.py --list --num-label 20 --chunks 1 --train-idx-fp %DATA_ROOT_DIR%/train.txt --test-idx-fp %DATA_ROOT_DIR%/test.txt --prefix %REC_DATA_ROOT_DIR%/sample-data

pause
GOTO do

:: 利用 *.lst 文件产生 *.idx 和 *.rec 文件。
:: 真正的图片文件存储在 `examples` 目录，可通过 `--root` 指定。
:gen_rec
	echo python scripts/im2rec.py --pack-label --color 1 --num-thread 1 --prefix %REC_DATA_ROOT_DIR% --root %IMAGES_DIR%
	python scripts/im2rec.py --pack-label --color 1 --num-thread 1 --prefix %REC_DATA_ROOT_DIR% --root %IMAGES_DIR%

pause
GOTO do

:: 训练模型
:: 如果要从某次训练模型开始继续迭代训练，通过 `--load_epoch start_times --epoch times` 指定
:train
	python scripts/cnocr_train.py --gpu 0 --emb_model_type %EMB_MODEL_TYPE% --seq_model_type %SEQ_MODEL_TYPE% --optimizer adam --epoch 20 --lr 1e-4 --train_file %REC_DATA_ROOT_DIR%/sample-data_train --test_file %REC_DATA_ROOT_DIR%/sample-data_test

pause
GOTO do

:: 在测试集上评估模型，所有badcases的具体信息会存放到文件夹 `evaluate/%MODEL_NAME%` 中
:: 指定某次迭代训练模型来进行评估，通过 `--model-epoch 1` 指定
:evaluate
	python scripts/cnocr_evaluate.py --model-name %MODEL_NAME% --model-epoch 1 -v -i %DATA_ROOT_DIR%/test.txt --image-prefix-dir examples --batch-size 128 -o evaluate/%MODEL_NAME%

pause
GOTO do

:predict
	python scripts/cnocr_predict.py --model_name %MODEL_NAME% --file examples/rand_cn1.png

pause
GOTO do

:package
	python setup.py sdist bdist_wheel

pause
GOTO do

:upload
	set VERSION=1.2.2
	python -m twine upload  dist/cnocr-%VERSION% --verbose

pause
GOTO do

:end
pause
GOTO :eof

学习计划

✔ 跑通cnocr的识别流程
✔ 制作成离线部署包，方便在其它电脑上快速部署
✔ 用来制作一些自用小工具，比如剪贴板上的图片OCR识别等
✔ 跑通它的训练流程，持续改善识别精度
- ✔ 提供新字体的预训练脚本
- ✔ 提供适合在windows下使用的脚本-Makefile.bat
✔ 尝试改进它的文本方向纠正效果

现在它的文本校正方向很容易出现180度倒转的情况，看有没有啥办法处理下，已经找到问题了，OpenCv4.5的bug，通过pip install opencv-python==4.4.0.46降版本就行了，不过竖版文本的识别官方是暂不支持的，这个要另外想办法了

后记

在Python3.8下部署时候遇到问题，因此推荐使用Python3.7及其以下版本，详情看官方教程

个人比对了一下各个ocr项目，cnocr对纯文本图片的识别的准确度非常高，并且速度较快，但限于文本方向必须正确，一旦出现文本垂直或水平翻转，精度会严重下降；另外它对于那些广告图的识别效果也较差，不过无论如何，肯定甩tesseract不止一条街

如果你的需求大多集中在广告图等非纯文本图片上，那么建议使用paddleocr或chinesecor-lite，不过这些ocr项目的部署也相对复杂些

ps:mxnet库有个Warning一直很烦人，但是其不影响使用，直接在site-packages\mxnet\symbol\symbol.py里注释掉line 925的warning输出

离线工具包制作

离线工具包 = python3环境(从anaconda3中提取而得) + 系统环境变量关联脚本 + 官方模型

同理，这个也是网络上各种开源AI工具离线包的制作方式

参考资料

更轻量的中文OCR—— cnocr-V1.2.2 ：最小模型只有 4.7M
OCRSpace

支持超长图片的OCR识别