pdf2htmlEX 使用方法

"高保真PDF至HTML转换"

Posted by Yummy on March 30, 2020

本文参考:

Pdf2html :高保真PDF至HTML转换windows系统下的 pdf2html (pdf 转html)开源工具 pdf2htmlEX 使用方法

pdf2htmlEX介绍

传统pdf2html有两种: 一种相当于pdf2text加一些比较弱的格式,基本跟pdf2text也差不了多少 另一种是把所有渲染成图片然后嵌到一个html,结果是文字信息都丢失(不能选择,拷贝),生成的文件还巨大。 pdf2htmlEX结合二者优点,既保留了文字,又保留了格式。 具体来说有如下特性 1.从pdf提取字体 2.保证渲染准确性,针对web进行优化(包括减少文件大小,文字行合并,(为HTML文字选择)字体重编码等等) 3.其他内容用图片显示 4.单文件输出,一个HTML搞定一切 pdf2htmlEX开源主页地址详细介绍中文讨论组windows系统可执行版下载地址

Download

P.S.

作者是 Lu Wang,原始项目和源代码是 pdf2htmlEX

我只是做了一些修改并针对Windows进行了编译,修改后的源代码在这里

使用方法

因为这里编译过程比较复杂,可以参考:Pdf2html :高保真PDF至HTML转换

解压

我选择了直接解压安装包

下载:pdf2htmlEX-win32-0.14.6-upx-with-poppler-data.zip

或者:pdf2htmlEX-win32-0.14.6-with-poppler-data.zip

将其解压(解压的目录一定不要包含中文路径!)

放置Pdf

将需要转换的pdf文件放入pdf2htmlEX的解压目录

  • data
  • test
  • abc.pdf
  • AUTHORS
  • ChangeLog
  • LICENSE
  • LICENSE_GPLv3
  • pdf2htmlEx.exe
  • README.md

运行pdf2htmlEX

使用命令提示符进入pdf2htmlEX的解压目录

1
2
cd d:\pdfex
d:

执行cmd命令调用pdf2htmlex进行转换:

1
pdf2htmlex --zoom 1.8 abc.pdf

执行完毕后,会在同目录下生成与pdf同名的html文件

1
2
3
4
D:\pdfex>pdf2htmlex --zoom 1.8 abc.pdf
Preprocessing: 750/750
Working: 3/750

最后会在目录下生成:

abc.html

简单的脚本

为了方便使用,我尝试了使用批处理脚本语言来完成这个工作,不用每次手动打开CMD窗口。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# test.bat
```bash
cd /d %~dp0

@echo off

set /P htmltname=Please input your file path : 

:: echo "%htmltname%"

set /P c=Choose the mode (1/2) ?  
if /I "%c%" EQU "1" goto one
if /I "%c%" EQU "2" goto two


:one
pdf2htmlEX.exe --zoom 1.8   "%htmltname%"
echo Generate one finished!
goto end

:two
pdf2htmlEX.exe --embed cfijo --dest-dir out "%htmltname%"
echo  Generate two finished!
:end

@echo Please open file path
start /min "" "%~dp0"

pause

```
cd /d %~dp0

@echo off

set /P htmltname=Please input your file path : 

:: echo "%htmltname%"

set /P c=Choose the mode (1/2) ?  
if /I "%c%" EQU "1" goto one
if /I "%c%" EQU "2" goto two


:one
pdf2htmlEX.exe --zoom 1.8   "%htmltname%"
echo Generate one finished!
goto end

:two
pdf2htmlEX.exe --embed cfijo --dest-dir out "%htmltname%"
echo  Generate two finished!
:end

@echo Please open file path
start /min "" "%~dp0"

pause

使用方法如下:

1
2
3
4
5
6
7
8
9
E:\Github\pdf2html>cd /d E:\Github\pdf2html\
Please input your file path : svg_background_with_page_rotation_issue402.pdf
Choose the mode (1/2) ?  1
Preprocessing: 1/1
Working: 1/1

Generate one finished!
Please open file path
请按任意键继续. . .
  1. 将pdf文件拷贝至E:\Github\pdf2html(视个人情况而定)
  2. 点击后只需要将你想要的pdf文件名复制进去
  3. 选择模式:生成一个单独的html文件生成 一个文件夹里面包含html 的元素

注意:其中还是存在bug,我并未提及。我遇见的是生成比较大的pdf会提示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
E:\Github\pdf2html>cd /d E:\Github\pdf2html\
Please input your file path : test.pdf
Choose the mode (1/2) ?  1
Preprocessing: 189/189
Lookup 'mark' Mark Positioning in Arabic lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 0 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
Lookup 'mark' Mark Positioning in Arabic lookup 1 has an
offset bigger than 65535 bytes. This means
FontForge must use an extension lookup to output it.
Not all applications support extension lookups.
^C终止批处理操作吗(Y/N)?

以后有机会再仔细研究。