在Java中使用tabula提取PDF中的表格數據

時間 2020-08-07 標籤 java 中使用 tabula 提取 pdf 中表格數據

問題：如何將pdf文件中指定的表格數據提取出來？html

嘗試過的工具包有：pdfbox、tabula。最終選用tabulajava

兩種工具的比較git

pdfbox

其中，pdfbox能將pdf中的內容直接提取成String，代碼片斷：github

public static void readPdf(String path) {
    try {
        PDDocument document = PDDocument.load(new File(path));
        PDFTextStripper textStripper = new PDFTextStripper();
        textStripper.setSortByPosition(true);
        String text = textStripper.getText(document);
        System.out.println(text);
        document.close();
    } catch (IOException e) {
            e.printStackTrace();
    }
}

可是若是遇到相似如下表格數據時，會有格式損失。不管中間有幾個空的單元格，最終只會轉爲1個製表位字符（/t）。app

　　　　　　　　　　　　　　　　　input1.pdfmaven

轉換爲String後是這樣的：ide

pdfbox優勢：方便快捷，使用簡單，maven添加依賴後，使用PDFTextStripper.getText()便可提取文本。工具

pdfbox缺點：提取帶有連續的空單元格的表格數據時，有格式丟失。post

tabula

重點介紹tabula，雖然底層也是用pdfbox實現的，可是通過封裝後的tabula更適合提取複雜格式的表格。測試

一樣的pdf表格，轉換爲csv後，是這樣的：

　　　　　　　　　　　　　　　　　　output1.csv

能夠說是完美還原了。

繼續嘗試轉換其餘格式的表格。

　　　　　　　　　　　　　　　　input2.pdf

　　　　　output2.csv

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　input3.pdf

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　output3.csv

測試結果：input一、input2基本能夠還原，input3有部分差別，但經過BufferedReader讀出來的值和pdf基本一致。

tabula的使用

1. 獲取

　　1.1 獲取源碼

　　從https://github.com/tabulapdf/tabula-java下載tabula-java-master.zip，使用Eclipse將tabula打成jar包，而後將jar引用到本身的工程中。也能夠直接下載tabula-1.0.2-jar-with-dependencies.jar到本地。

　　1.2 獲取Windows客戶端工具

　　從https://tabula.technology下載tabula-win-1.2.0.zip到本地，解壓後運行tabula.exe便可使用。

2. 使用

　　2.1 解讀README.md

## Usage Examples
`tabula-java` provides a command line application:
$ java -jar target/tabula-1.0.2-jar-with-dependencies.jar --help

usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
       <FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
       [-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs

 -a,--area <AREA>           Portion of the page to analyze. Accepts top,
                            left,bottom,right.
                            Example: --area 269.875,12.75,790.5,561.
                            If all values are between 0-100 (inclusive)
                            and preceded by '%', input will be taken as
                            % of actual height or width of the page.
                            Example: --area %0,0,100,50.
                            To specify multiple areas, -a option should 
                            be repeated. Default is entire page
 -b,--batch <DIRECTORY>     Convert all .pdfs in the provided directory.
 -c,--columns <COLUMNS>     X coordinates of column boundaries. Example
                            --columns 10.1,20.2,30.3
 -d,--debug                 Print detected table areas instead of
                            processing.
 -f,--format <FORMAT>       Output format: (CSV,TSV,JSON). Default: CSV
 -g,--guess                 Guess the portion of the page to analyze per
                            page.
 -h,--help                  Print this help text.
 -i,--silent                Suppress all stderr output.
 -l,--lattice               Force PDF to be extracted using lattice-mode
                            extraction (if there are ruling lines
                            separating each cell, as in a PDF of an Excel
                            spreadsheet)
 -n,--no-spreadsheet        [Deprecated in favor of -t/--stream] Force PDF
                            not to be extracted using spreadsheet-style
                            extraction (if there are no ruling lines
                            separating each cell)
 -o,--outfile <OUTFILE>     Write output to <file> instead of STDOUT.
                            Default: -
 -p,--pages <PAGES>         Comma separated list of ranges, or all.
                            Examples: --pages 1-3,5-7, --pages 3 or
                            --pages all. Default is --pages 1
 -r,--spreadsheet           [Deprecated in favor of -l/--lattice] Force
                            PDF to be extracted using spreadsheet-style
                            extraction (if there are ruling lines
                            separating each cell, as in a PDF of an Excel
                            spreadsheet)
 -s,--password <PASSWORD>   Password to decrypt document. Default is empty
 -t,--stream                Force PDF to be extracted using stream-mode
                            extraction (if there are no ruling lines
                            separating each cell)
 -u,--use-line-returns      Use embedded line returns in cells. (Only in
                            spreadsheet mode.)
 -v,--version               Print version and exit.

其中一些附加參數可視狀況選用。

-a：表示指定某個矩形區域，程序只會對此區域進行解析，相似pdfbox的PDFTextStripperByArea.addRegion()。-a後跟4個值，以逗號分隔。分別表示：

區域上邊界到頁面上邊界的距離（或百分比）

區域左邊界到頁面左邊界的距離（或百分比）

區域下邊界到頁面上邊界的距離（或百分比）

區域右邊界到頁面左邊界的距離（或百分比）

以%開頭時表示百分比，好比-a %10,0,90,100。

-o：表示將結果輸出到文件，後面跟文件路徑

-p：表示提取指定頁，後面跟數字，若是不指定則默認爲1

-t：表示按流的方式提取，遇到合併單元格時使用

　　2.2 命令行運行

使用cmd命令行工具直接運行jar包

java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv

　　2.3 程序內調用

String cmd = "java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv";
Runtime.getRuntime().exec();

轉載於:https://www.cnblogs.com/kong90hou/p/9138219.html