問題:如何將pdf文件中指定的表格數據提取出來?html
嘗試過的工具包有:pdfbox、tabula。最終選用tabulajava
兩種工具的比較git
- pdfbox
其中,pdfbox能將pdf中的內容直接提取成String,代碼片斷:github
public static void readPdf(String path) { try { PDDocument document = PDDocument.load(new File(path)); PDFTextStripper textStripper = new PDFTextStripper(); textStripper.setSortByPosition(true); String text = textStripper.getText(document); System.out.println(text); document.close(); } catch (IOException e) { e.printStackTrace(); } }
可是若是遇到相似如下表格數據時,會有格式損失。不管中間有幾個空的單元格,最終只會轉爲1個製表位字符(/t)。app
input1.pdfmaven
轉換爲String後是這樣的:ide
pdfbox優勢:方便快捷,使用簡單,maven添加依賴後,使用PDFTextStripper.getText()便可提取文本。工具
pdfbox缺點:提取帶有連續的空單元格的表格數據時,有格式丟失。post
- tabula
重點介紹tabula,雖然底層也是用pdfbox實現的,可是通過封裝後的tabula更適合提取複雜格式的表格。測試
一樣的pdf表格,轉換爲csv後,是這樣的:
output1.csv
能夠說是完美還原了。
繼續嘗試轉換其餘格式的表格。
input2.pdf
output2.csv
input3.pdf
output3.csv
測試結果:input一、input2基本能夠還原,input3有部分差別,但經過BufferedReader讀出來的值和pdf基本一致。
tabula的使用
1. 獲取
1.1 獲取源碼
從https://github.com/tabulapdf/tabula-java下載tabula-java-master.zip,使用Eclipse將tabula打成jar包,而後將jar引用到本身的工程中。也能夠直接下載tabula-1.0.2-jar-with-dependencies.jar到本地。
1.2 獲取Windows客戶端工具
從https://tabula.technology下載tabula-win-1.2.0.zip到本地,解壓後運行tabula.exe便可使用。
2. 使用
2.1 解讀README.md
## Usage Examples `tabula-java` provides a command line application: $ java -jar target/tabula-1.0.2-jar-with-dependencies.jar --help
usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f <FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r] [-s <PASSWORD>] [-t] [-u] [-v]
Tabula helps you extract tables from PDFs -a,--area <AREA> Portion of the page to analyze. Accepts top, left,bottom,right. Example: --area 269.875,12.75,790.5,561. If all values are between 0-100 (inclusive) and preceded by '%', input will be taken as % of actual height or width of the page. Example: --area %0,0,100,50. To specify multiple areas, -a option should be repeated. Default is entire page -b,--batch <DIRECTORY> Convert all .pdfs in the provided directory. -c,--columns <COLUMNS> X coordinates of column boundaries. Example --columns 10.1,20.2,30.3 -d,--debug Print detected table areas instead of processing. -f,--format <FORMAT> Output format: (CSV,TSV,JSON). Default: CSV -g,--guess Guess the portion of the page to analyze per page. -h,--help Print this help text. -i,--silent Suppress all stderr output. -l,--lattice Force PDF to be extracted using lattice-mode extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) -n,--no-spreadsheet [Deprecated in favor of -t/--stream] Force PDF not to be extracted using spreadsheet-style extraction (if there are no ruling lines separating each cell) -o,--outfile <OUTFILE> Write output to <file> instead of STDOUT. Default: - -p,--pages <PAGES> Comma separated list of ranges, or all. Examples: --pages 1-3,5-7, --pages 3 or --pages all. Default is --pages 1 -r,--spreadsheet [Deprecated in favor of -l/--lattice] Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) -s,--password <PASSWORD> Password to decrypt document. Default is empty -t,--stream Force PDF to be extracted using stream-mode extraction (if there are no ruling lines separating each cell) -u,--use-line-returns Use embedded line returns in cells. (Only in spreadsheet mode.) -v,--version Print version and exit.
其中一些附加參數可視狀況選用。
-a:表示指定某個矩形區域,程序只會對此區域進行解析,相似pdfbox的PDFTextStripperByArea.addRegion()。-a後跟4個值,以逗號分隔。分別表示:
區域上邊界到頁面上邊界的距離(或百分比)
區域左邊界到頁面左邊界的距離(或百分比)
區域下邊界到頁面上邊界的距離(或百分比)
區域右邊界到頁面左邊界的距離(或百分比)
以%開頭時表示百分比,好比-a %10,0,90,100。
-o:表示將結果輸出到文件,後面跟文件路徑
-p:表示提取指定頁,後面跟數字,若是不指定則默認爲1
-t:表示按流的方式提取,遇到合併單元格時使用
2.2 命令行運行
使用cmd命令行工具直接運行jar包
java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv
2.3 程序內調用
String cmd = "java -jar tabula-1.0.2.jar E:\tmp\input\input1.pdf -o E:\tmp\output\output1.csv";
Runtime.getRuntime().exec();