Tesseract

Tesseract 开源 OCR 引擎

tesseract-ocr / tesseract
https://github.com/tesseract-ocr/tesseract

训练好的语言模型库
tesseract-ocr / tessdata
https://github.com/tesseract-ocr/tessdata

Tesseract

psm 页面分割方式

通过 --psm 指定页面分割方式参数，比如 --psm 7

--psm 7 适合单行文本，比如车牌识别
--psm 8 适合单个单词识别

tesseract --help-psm
Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy
https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

oem 引擎模式

通过 --oem 参数指定引擎模式，例如 --oem 1

0 遗留的老模式
1 LSTM 神经网络模式
2 老模式+LSTM
3 默认

tessdata_best 和 tessdata_fast 中的模型只支持 LSTM 引擎（–oem 1），不支持 -oem 0 老模式，使用 tess4j 时如果新模型传入 -oem 0 参数会直接崩溃(ERROR)

tesseract --help-oem
OCR Engine modes:
  0    Legacy engine only.
  1    Neural nets LSTM engine only.
  2    Legacy + LSTM engines.
  3    Default, based on what is available.

三个官方训练好的语言模型库

Traineddata Files for Version 4.00 +
https://tesseract-ocr.github.io/tessdoc/Data-Files.html

tessdata_fast 速度最快，int 模型
https://github.com/tesseract-ocr/tessdata_fast
tessdata_best 准确率最高，float 模型
https://github.com/tesseract-ocr/tessdata_best
tessdata 遗留的老模型
https://github.com/tesseract-ocr/tessdata

实际使用中发现 tessdata 库中的模型最大，效果最好，比 tessdata_best 中的还要好。

tessdata_best 和 tessdata_fast 中的模型只支持 LSTM 引擎（–oem 1），不支持 -oem 0 老模式，使用 tess4j 时如果新模型传入 -oem 0 参数会直接崩溃(ERROR)

使用 tesseract 命令进行ocr识别

1、下载训练好的语言模型
https://tesseract-ocr.github.io/tessdoc/Data-Files.html
下载中文模型 chi_sim.traineddata 放到 /usr/share/tesseract/4/tessdata 目录，或者放到任意目录执行命令时指定 data 目录

tesseract --tessdata-dir / tesseract-test.png outfile -l chi_sim

–tessdata-dir 指定语言模型文件目录，默认 /usr/share/tesseract/4/tessdata
tesseract-test.png 是输入图片文件
outfile 是输出结果文件，命令执行完会生成 outfile.txt 文件
-l chi_sim 指定语言

效果很好，准确率很高

Tesseract 性能

使用识别率高的 tessdata_best 模型的话
大段中文识别很慢，需要将近20秒才出结果
识别四五十字的中英文混合，也需要10秒钟

# time tesseract --tessdata-dir / tesseract-test.png outfile -l chi_sim
Tesseract Open Source OCR Engine v4.1.3 with Leptonica

real    0m18.322s
user    0m41.052s
sys     0m0.276s

改用速度快的 tessdata_fast 模型会快一些，效果也不是很差

# time tesseract --tessdata-dir / tesseract-test.png outfile -l chi_sim
Tesseract Open Source OCR Engine v4.1.3 with Leptonica

real    0m14.212s
user    0m43.986s
sys     0m0.107s

Tesseract 最佳实践

利用 Java Graphics2D 将图片左下角一块 100 * 40 区域填充一个黑色矩形框，将一串白色纯数字写到黑底矩形框上。

使用 Tesseract 4.1.3 配置如下参数可以 100% 准确率识别出这些黑底白字的纯数字：

使用 legacy eng 语言模型
设置 oem 为 0，即 legacy 模式
psm 默认
-c tessedit_char_whitelist=0123456789 指定白名单为纯数字

比较奇怪的是，Tesseract 4.1.3 上，legacy eng模型+oem=0 比 best eng模型+oem=1 效果好很多。

CentOS7 上安装 Tesseract

CentOS7 上 yum 安装使用 Tesseract 4.1.3

https://tesseract-ocr.github.io/tessdoc/Installation.html
https://tesseract-ocr.github.io/tessdoc/InstallationOpenSuse.html

按官方文档安装后报错：
leptonica-1.76.0-2.5.x86_64.rpm 的公钥尚未安装
根据下面这个文档
Public key for tesseract-4.00~git2686-1.1.x86_64.rpm is not installed
https://github.com/tesseract-ocr/tesseract/issues/1749

加 –nogpgcheck 忽略公钥检查

sudo yum-config-manager --add-repo http://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/RHEL_7/
sudo yum update -y
sudo yum install tesseract -y --nogpgcheck

查看版本号：

[centos@lightsail lib64]$ tesseract -v
tesseract 4.1.3
 leptonica-1.76.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

Alexander_Pozdnyakov 也提供 tesseract5 的 yum 源，但是需要centos8
https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov:/tesseract5/

安装前备份了一份 /usr/lib64 到 /matt/lib64/，安装后 diff 比较发现多出来这些 lib，这一步是为了找出 tesseract 需要哪些lib，之后打包到 SpringBoot 镜像里离线使用
后来换了个linux版本再次安装对比发现还多了 libpng15.so.15 和 libpng15.so.15.13.0

# diff -r /usr/lib64/ /matt/lib64/
Only in /usr/lib64/: libgomp.so.1
Only in /usr/lib64/: libgomp.so.1.0.0
Only in /usr/lib64/: libjbig85.so.2.0
Only in /usr/lib64/: libjbig.so.2.0
Only in /usr/lib64/: libjpeg.so.62
Only in /usr/lib64/: libjpeg.so.62.1.0
Only in /usr/lib64/: liblept.so.5
Only in /usr/lib64/: liblept.so.5.0.3
Only in /usr/lib64/: libtesseract.so.4
Only in /usr/lib64/: libtesseract.so.4.0.1
Only in /usr/lib64/: libtiff.so.5
Only in /usr/lib64/: libtiff.so.5.2.0
Only in /usr/lib64/: libtiffxx.so.5
Only in /usr/lib64/: libtiffxx.so.5.2.0
Only in /usr/lib64/: libwebpmux.so.0
Only in /usr/lib64/: libwebpmux.so.0.0.0
Only in /usr/lib64/: libwebp.so.4
Only in /usr/lib64/: libwebp.so.4.0.2

2023.7.1 安装的最新版还是 tesseract 4.1.3，不是 tesseract 5.x 版本，但和 tess4j-5.7.0 搭配使用是正常的。

CentOS7 上编译安装 Tesseract 5.2.0

1、编译工具安装
yum install -y gcc gcc-c++ make autoconf automake libtool libjpeg libpng libtiff zlib libjpeg-devel libpng-devel libtiff-devel zlib-devel

2、升级gcc8（编译 Tesseract5 需要 c++17）
yum install -y centos-release-scl
yum install -y devtoolset-8-gcc*
mv /usr/bin/gcc /usr/bin/gcc-4.8.5
ln -s /opt/rh/devtoolset-8/root/bin/gcc /usr/bin/gcc
mv /usr/bin/g++ /usr/bin/g++-4.8.5
ln -s /opt/rh/devtoolset-8/root/bin/g++ /usr/bin/g++
gcc –version
gcc (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)
g++ –version
g++ (GCC) 8.3.1 20190311 (Red Hat 8.3.1-3)

3、安装 leptonica（Tesseract 依赖 leptonica 进行图片处理）
wget http://www.leptonica.org/source/leptonica-1.82.0.tar.gz
tar zxf leptonica-1.82.0.tar.gz
cd leptonica-1.82.0/
./configure && make && make install

编辑 /etc/profile 添加环境变量
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
export LIBLEPT_HEADERSDIR=/usr/local/include
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
source /etc/profile

4、编译安装 tesseract 5.2.0
wget https://github.com/tesseract-ocr/tesseract/archive/refs/tags/5.2.0.tar.gz
tar xvf 5.2.0.tar.gz
cd tesseract-5.2.0
./autogen.sh
./configure –with-extra-includes=/usr/local/include –with-extra-libraries=/usr/local/lib
make && make install

完成后提示：
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR’
flag during linking and do at least one of the following:

add LIBDIR to the `LD_LIBRARY_PATH’ environment variable
during execution
add LIBDIR to the `LD_RUN_PATH’ environment variable
during linking
use the `-Wl,-rpath -Wl,LIBDIR’ linker flag
have your system administrator add LIBDIR to `/etc/ld.so.conf’

完成后查看版本

# tesseract -v
tesseract 5.2.0
 leptonica-1.82.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7
 Found SSE4.1
 Found OpenMP 201511

5、下载语言
cd /usr/local/share/tessdata/
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
这里下载的是 legacy 英文语言模型，也可以下载 fast 或 best 的

6、测试
tesseract 25.png output
执行后 output.txt 中就是 ocr 结果

https://www.jianshu.com/p/edfabeaf6ba8
http://www.nanstar.top/p/wiki_1649411481701
https://gist.github.com/zhuth/b75dd8440abb0771e510efa1f410086e

SpringBoot+CentOS7 上使用 tess4j

在 Linux 上使用 tess4j 需要先安装 tesseract，否则 ocr 识别会报找不到下面这些 lib

java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
libgomp.so.1: cannot open shared object file: No such file or directory
libgomp.so.1: cannot open shared object file: No such file or directory
Native library (linux-x86-64/libtesseract.so) not found in resource path ([jar:file:/blog-server.jar!/BOOT-INF/classes!/,

java.lang.UnsatisfiedLinkError: Error loading shared library liblept.so.5: No such file or directory (needed by /root/.cache/JNA/temp/jna4202543007498402592.tmp)

java.lang.UnsatisfiedLinkError: libgomp.so.1: cannot open shared object file: No such file or directory

java.lang.UnsatisfiedLinkError: libtiff.so.5: cannot open shared object file: No such file or directory

java.lang.UnsatisfiedLinkError: libjpeg.so.62: cannot open shared object file: No such file or directory

java.lang.UnsatisfiedLinkError: libwebp.so.4: cannot open shared object file: No such file or directory

java.lang.UnsatisfiedLinkError: libjbig.so.2.0: cannot open shared object file: No such file or directory

如果 linxu 是离线环境，可以利用 docker 在本机模拟 CentOS7 环境，本地先启动个 SpringBoot+CentOS7 的镜像，进入容器后安装 tesseract 后把 lib 拷贝出来，注意架构，如果M1 Mac默认可能是 arm64/aarch64 架构

1、yum 安装 tesseract，安装版本是 4.1.3，经测试和 tess4j-5.7.0 搭配使用是正常的。

2、将相关 lib 文件单独拷贝出来，包含下面这些，这是我安装 tesseract 前后对比 /usr/lib64 目录找出来的

# ls /usr/lib64/ |egrep "libgomp|libjbig|libjpeg|liblept|libtess|libtiff|libwebpmux|libwebp" |xargs -i cp {} /tesseract-lib-4.1.3

# ls /tesseract-lib-4.1.3
libgomp.so.1      libjbig85.so.2.0  libjpeg.so.62      liblept.so.5      libtesseract.so.4      libtiff.so.5      libtiffxx.so.5      libwebpmux.so.0      libwebp.so.4
libgomp.so.1.0.0  libjbig.so.2.0    libjpeg.so.62.1.0  liblept.so.5.0.3  libtesseract.so.4.0.1  libtiff.so.5.2.0  libtiffxx.so.5.2.0  libwebpmux.so.0.0.0  libwebp.so.4.0.2

3、将上面的 lib 文件放到 Maven 项目的 /src/main/resources/linux-x86-64 目录中（注意 amd64/x86_64 架构上 tess4j 才会去 classpath 下的 linux-x86-64 子目录找 lib，不同架构的不同）
或者，由于我这里是用 docker 部署的 SpringBoot 服务，基础镜像用的是 centos7，直接将这些 lib 打包到容器中的 /usr/lib64 目录，然后设置 LD_LIBRARY_PATH 环境变量增加去 /usr/lib64 目录找 lib 即可
Dockerfile 文件关键部分如下：

ADD devops/tesseract-lib-4.1.3/* /usr/lib64/

# tesseract lib 目录
ENV LD_LIBRARY_PATH $LD_LIBRARY_PATH:/usr/lib64

之后启动 SpringBoot 服务即可正常使用 tess4j 做 ocr

Linux系统安装及部署tess4j项目（CentOS 7为例)
https://blog.csdn.net/makang110/article/details/122623811

Linux环境如何支持使用tess4j进行ORC
https://www.jianshu.com/p/134a09c5af9e

Linux下部署tesseract-ocr以支持tess4j
https://blog.csdn.net/dhx20022889/article/details/122939939

Tess4J -4.0.2- Linux 实践 [解决：Tess4J - Native library (linux-x86-64/libtesseract.so) not found in resource path]
https://www.cnblogs.com/socketqiang/p/10960800.html

在Linux下使用Tess4j的依赖问题和未成功的尝试
https://blog.desmondcobb.org/archives/671

M1 Mac 上使用 tess4j

1、安装
brew install tesseract
tesseract -v 可以看到显示对应的版本信息

brew list tesseract 查看安装路径
/opt/homebrew/Cellar/tesseract/5.2.0/lib

2、拷贝 libtesseract.5.dylib 到 Java 项目的 resources 文件夹下，改名为 libtesseract.dylib（注意不要直接拷贝 lib 中的 libtesseract.dylib，只是个链接）
无 libtesseract.dylib 会报错：

java.lang.UnsatisfiedLinkError: Unable to load library 'tesseract':
dlopen(libtesseract.dylib, 0x0009): tried: '/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home/bin/./libtesseract.dylib' (no such file), 'libtesseract.dylib' (no such file), '/usr/local/lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file), '/Users/xxx/git/my/spring-boot-masikkk/common/libtesseract.dylib' (no such file), '/usr/local/lib/libtesseract.dylib' (no such file), '/usr/lib/libtesseract.dylib' (no such file)

3、可以使用默认的语言模型，也可以下载训练好的语言模型比如英文的 eng.traineddata，放到任意目录，通过 setDatapath 指定语言文件目录

Tess4J

nguyenq / tess4j
https://github.com/nguyenq/tess4j

Tess4J jar 包中自带英文语言模型文件

maven 引入的 tess4j jar 包中有个 tessdata 目录，里面有训练好的 eng.traineddata 和 osd.traineddata 语言模型，通过下面代码指定使用 jar 包中的自带模型：

instance.setDatapath(LoadLibs.extractTessResources("tessdata").getAbsolutePath()); // 如果没有自己的语言模型，可以使用默认的

Tess4J 使用示例

@Test
@SneakyThrows
public void testTess4j() {
    // 加载图片
    InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("bg_night.png");
    File imageFile = File.createTempFile("temp", ".png");
    FileUtils.copyInputStreamToFile(inputStream, imageFile);

    // 初始化 Tesseract 实例，设置语言，设置模型目录
    ITesseract instance = new Tesseract();  // JNA Interface Mapping
//        instance.setLanguage("chi_sim"); // 中文
    instance.setLanguage("eng"); // 英文
    // 如果没有自己的语言模型，可以使用 tess4j jar 包中自带的 英文eng.traineddata 和 osd.traineddata 两个模型
    instance.setDatapath(LoadLibs.extractTessResources("tessdata").getAbsolutePath());
    // 或者指定自己训练好的语言模型目录
//        instance.setDatapath("/Users/user/git/my/spring-boot-masikkk/common/src/test/resources");

    // 整张图上做OCR
    long ts = System.currentTimeMillis();
    String result = instance.doOCR(imageFile);
    log.info("结果: {}，耗时: {}", result, System.currentTimeMillis() - ts);

    // 指定范围做OCR，x,y是以左上角为原点，width和height是以xy为基础
    Rectangle rect = new Rectangle(8, 604, 59, 18);
    ts = System.currentTimeMillis();
    result = instance.doOCR(imageFile, rect);
    log.info("结果: {}，耗时: {}", result, System.currentTimeMillis() - ts);

    // 设置字符白名单，只检测数字
    ts = System.currentTimeMillis();
    instance.setVariable("tessedit_char_whitelist", "0123456789");
    result = instance.doOCR(imageFile, rect);
    log.info("结果: {}，耗时: {}", result, System.currentTimeMillis() - ts);
}

https://juejin.cn/post/7066642049537146893

https://www.cnblogs.com/pejsidney/p/9487881.html

tessedit_char_whitelist 设置字符白名单

如果确定图片里有哪些固定的字符，可以设置 tessedit_char_whitelist 白名单，使检测结果更准确
比如设置只检测数字
instance.setVariable("tessedit_char_whitelist", "0123456789");

tess4j Set only to identify numbers and letters
https://stackoverflow.com/questions/42430384/tess4j-set-only-to-identify-numbers-and-letters

Tess4J 不支持多线程并发访问 instance

问题：
全局初始化一个 ITesseract instance = new Tesseract() 然后多线程并发进行 ocr 会报一个底层 cpp 错误，比如

static_cast<unsigned>(id) < this->size():Error:Assert failed:in file src/ccutil/unicharset.cpp, line 283

解决方法：
每次 OCR 调用 new Tesseract() instance

NPE during concurrent thread access of a single tess4j instance
https://stackoverflow.com/questions/28954476/npe-during-concurrent-thread-access-of-a-single-tess4j-instance

Tess4j on Windows 64-bit: exception on multiple threads
https://stackoverflow.com/questions/24799038/tess4j-on-windows-64-bit-exception-on-multiple-threads

How to use multi thread in tess4j
https://github.com/nguyenq/tess4j/issues/46

Multi threading / parallel processing - Java 8 - JVM 64 bit - Tess4J 1.3.0 / 1.4.1
https://sourceforge.net/p/tess4j/discussion/1202293/thread/4562eccb/

JNA 版本冲突导致报错

instance.doOCR 报错：

java.lang.NoSuchMethodError: com.sun.jna.Native.load(Ljava/lang/String;Ljava/lang/Class;)Lcom/sun/jna/Library;
    at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:83)

原因：
JNA 版本冲突，通过 Dependency Analyzer 插件看到引入了两个版本的 JNA，一个是 elasticsearch-7.6.2 引入的 JNA-4.5.1，一个是 tess4j-5.7.0 引入的 JNA-5.13.0

解决：
exclusion elasticsearch-7.6.2 中的 JNA

https://github.com/testcontainers/testcontainers-java/issues/3734

模型和引擎模式不匹配会直接导致Java程序崩溃(Error)

比如使用 eng best 语言模型，但 ocrEngineMode 设置为 0，会报下面的 ERROR，程序直接崩溃

Error: Tesseract (legacy) engine requested, but components are not present in /root/apps/da/tesseract-traineddata/eng_best.traineddata!!
Failed loading language 'eng_best'
Tesseract couldn't load any languages!
Warning: Invalid resolution 0 dpi. Using 70 instead.
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f9c06c43737, pid=7, tid=0x00007f9c0e8c9700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_202-b08) (build 1.8.0_202-b08)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.202-b08 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libtesseract.so.4.0.1+0xc9737]  tesseract::Tesseract::recog_all_words(PAGE_RES*, ETEXT_DESC*, TBOX const*, char const*, int)+0x637
#
# Core dump written. Default location: /root/apps/da/core or core.7
#
# An error report file with more information is saved as:

当前位置 : 首页 » 文章分类 : 开发 » Tesseract

Tesseract

Tesseract

psm 页面分割方式

oem 引擎模式

三个官方训练好的语言模型库

使用 tesseract 命令进行ocr识别

Tesseract 性能

Tesseract 最佳实践

CentOS7 上安装 Tesseract

CentOS7 上 yum 安装使用 Tesseract 4.1.3

CentOS7 上编译安装 Tesseract 5.2.0

SpringBoot+CentOS7 上使用 tess4j

M1 Mac 上使用 tess4j

Tess4J

Tess4J jar 包中自带英文语言模型文件

Tess4J 使用示例

tessedit_char_whitelist 设置字符白名单

Tess4J 不支持多线程并发访问 instance

JNA 版本冲突导致报错

模型和引擎模式不匹配会直接导致Java程序崩溃(Error)

页面信息

评论