XML has brought us many tools for all kinds of tasks. They make a lot of work easy, but developers lazy too. It is harder and harder to find really skilled people these days. New developers haven’t learned from doing things the hard way, even though doing things the hard way still often produces the best results. Here an example of converting PDF to XML using XSLT 2.0 to provide a cheap solution, done the ‘hard’ way.
One of our customers approached us (speaking on behalf of my company, Daidalos). They had some PDF documents with pretty clear layouts. Specific parts of these had to be converted to XML on a regular basis. Obviously, they didn’t want to spend much on it, so we had to look for a cheap solution.
The task seemed simple. Visually identifying the interesting parts, and classifying the information was easy enough. We were hoping to automate the process though. This meant we needed a conversion from PDF to XML with preservation of sufficient layout details to distinguish and classify the interesting bits.
There are plenty tools readily available that help with converting PDF to XML. Most of them cost money, though. The good ones cost a lot of money. We had not that much to spend, so we had to settle for one of the cheapest that could do the trick. Most of the cheaper ones are cheap for a good reason. They were either meant for text extraction only (with a bare minimum of layout), or were good at some aspects of the conversion, but not at what we needed: preservation of sufficient layout details.
One of my colleagues, though, stumbled upon a very small executable called: pdf2html. It is freely available on the internet. It is actually not too well in converting PDF to XML (you need the -xml option), as it occasionally produces ill-formed XML. Nor does it recognize a lot in the PDF. But it does two things pretty well:
1. identify all text runs within the PDF (with a bit of layout information),
2. add accurate positioning information to the text runs.
It extracts sufficient information to produce HTML and CSS out of it which mimics the exact textual layout in the PDF. The accurate positioning information is all we need to identify and classify the relevant information. It is this tool that we employed for this task.
This bare conversion to XML was only part of the job though; most of the work still had to be done Just to give an impression of its output, let me show you a sample. The PDF input looked like this:
The output is actually quite messy. The largest issues are:
· Continues text is chunked in separate lines· Chunked lines cause hyphenated words to get cut in halves
· Lines can be chunked in separate text runs as well
· No hierarchy in the text structure
· No advanced formatting, like tables and lists
· No separation of headers, footers, and margin texts
· No images, or other draws layout, not even table borders
In short: it looks like a literal translation from the PDF text floats to XML, and text floats only. Making sense out of this mess is a real challenge. And I went for it.
As said, the output has one really valuable asset: the positioning information. This positioning information helps coping with most of the above problems. It just needs a careful approach. By applying general enhancements you can get a more logical separation of the information. Once there, you can add meaning to the information.
I roughly took the following steps in mentioned order to apply general enhancement of the raw XML output:
1. Isolating text runs in different zones (like header and footer)
2. Gathering text runs on the same ‘line’ (ignoring columns here)
3. Translate indentation to hierarchy (helps finding lists, provides bare table/column handling)
4. Merging of lines to build paragraphs
For the first step you need to manually ‘measure’ the height of both. But once known, you can separate text runs belonging to headers, from those belonging to footers, and from those belonging to the body. I simply wrapped them in new elements named ‘header’, ‘footer’, and ‘body’.
The second step is not very difficult either. Search within each page for text runs that have the same top position. Wrap them in a ‘line’ element. A simple xsl:for-each-group on the ‘top’ attribute did the trick.
The third step is a bit similar to the second step, but instead of looking at the top position, you look at the ‘left’ attribute of each line. Lines starting more to the right than above ones are wrapped in nested ‘block’ elements. Having separated header and footer earlier makes sure their contents doesn’t get mixed with text from the body.
At this point the output has already improved much. It more or less looks like this:
The fourth step requires a bit more effort, but involves not much more than joining the contents of adjacent ‘line’ elements. The ‘block’ elements help to identify where paragraphs should start and end. This is also a good opportunity to resolve hyphenation of words. Take any hyphen at the end of a line, cut it off, and concatenate with the following line without adding an extra space in between. Not entirely fail-safe, but the loss is marginal.
With just four small steps, the output can be greatly enhanced, making it much easier to isolate the interesting information, and add meaning to it. From here it takes only about 20 lines of real XSLT code to get to this end result:
In the end I needed only about 800 lines of XSLT code to get to this result. And nearly half of them are empty lines and comments. With roughly 250 actual lines of code more I was able to produce two more XML extracts of which one involved interpreting complex tabular information.
I’m not saying this approach can crack all your PDF to XML problems. But I do hope to have shown, that doing something the ‘hard’ way doesn’t have to be that really hard. And it can produce high quality results, perhaps even higher that using a more advanced PDF to XML tool.
I’m not saying this approach can crack all your PDF to XML problems. But I do hope to have shown, that doing something the ‘hard’ way doesn’t have to be that really hard. And it can produce high quality results, perhaps even higher that using a more advanced PDF to XML tool.
Hi, this is an interesting post but I have some doubts
ReplyDelete1. what XSLT processor are you using?
2. plz, could you post your XSLT template?
10x!
I used Saxon 8+ in this case. Most interesting code is available on Github: https://github.com/grtjn/utilities/blob/master/enrich-pdf2html.xsl
ReplyDeleteCheers
Hi,
ReplyDeleteWhat is the pdf2html tool you are referring to? Can you post a link?
Thanks!
It was a while ago, but I am pretty certain it was http://pdftohtml.sourceforge.net/
ReplyDeleteUnfortunately it wasnt kept up to date very well, so you might have some issues with recent pdf..
Thanks for getting back! I got to the third step using
ReplyDeletehttps://github.com/grtjn/utilities/blob/master/enrich-pdf2html.xsl
But is there a different xsl to get to step 4?
Thanks!
Good point. I thought I had integrated that in the enrich xslt, but apparently not. I have pushed two more xslt's to the same github project:
ReplyDeletehttps://github.com/grtjn/utilities
All templates in them have a mode attribute, so you need to start in that mode to use them directly, or wrap them in another to switch to their modes.
I used them as includes from yet other xslt's, but those were not worth sharing. Too specific, and not very exciting..
Cheers