Is Portable Document Format obsolete?

I spent two years on a project which aimed to create highly sophisticated PDF reports for teachers. These automatically generated reports contained bar chars, pie charts, line charts, a lot of tables and everything else. It worked: end users were very happy about the new reports system. But still, something keeps bothering me.

Making PDFs can be tough. Do we still need them?

There are many ways you can generate PDF files. I started from a Java class which calculated every table cell dimensions, position of every text line etc. Then I used PHP libraries like FPDF and TCPDF. The latter had a very simple HTML parsing engine. Then I came across Zend_Pdf and kept using it until I stumbled upon wkhtmltopdf. Yeah, that was a blast! Someone put together a WebKit rendering engine and a PDF printer.

But even wkhtml was not prepared for all the challenges that my client had for me. Dynamic headers and footers. Sophisticated charts which I prepared as SVG files. Errors with dividing tables across multiple pages (with table headers of course). Lack of dynamic font size adjustment depending on the text amount. Trouble with applying custom fonts.

Chrome 59 offered printing to PDF via headless mode. Having recent Chrome rendering engine is great, but I lost some features available in wkhtmltopdf (like these crazy page headers and footers).

You can already see that generating proper PDF files might require a lot of tweaking and quirks. What if we don’t need PDFs at all? In the era of portable devices coming in all shapes and sizes, do we still have to limit our documents in an A4 or any other paper format?

Is PDF always a right choice for our project?

Let’s think about the business needs for a while. The system I mentioned was designed for teachers who got used to print everything on a daily basis. I’m not sure how it works in other countries, but Polish teachers print a lot of stuff. They download a PDF, then print and put on their desks to discuss with pupils and their parents. That’s their habit.

I worked on a similar reporting system designed specifically for parents. The initial business decision was to copy the same well-known PDF system and do some adjustments. I asked my client: hey, do we really need to spend a lot of time fine-tuning PDFs again? Do our end users need PDFs? They have different habits, they have smartphones and tablets. They are not going to print our reports. They want a modern, responsive web app.

That’s how I convinced my client to drop PDFs in the new system. I made development a lot faster and easier. We made a modern frontend layer using ReactJS and everyone was happy with the result.

Do we still need static PDF layouts for invoices, order confirmations, tickets, magazines, official documents, reports?

In Poland, we have electronic train tickets that can be downloaded on a smartphone. It is an ordinary PDF file with all the ticket details. But all we have to do is to show a QR code during ticket control. The same applies to concert tickets. Why use PDF if a simple HTML would be enough?

Further reading

Will PDF files become obsolete in 5-10 years?

Picking a PHP tool to read and manipulate PDF files

In the previous article I described several tools that can be used together with PHP to create PDF files. Back then, the choice was not easy and we had a lot of criteria to consider while picking the best tool. Today we will browse possibilities to read and edit existing PDF files.

Native PHP libraries

Again, we will start from checking if there are any PHP libraries to manipulate PDF files without depending on external binary tools.

pdfparser

There is an interesting library called smalot/pdfparser. It has almost 1000 stars on GitHub. It utilizes TCPDF Parser to parse a PDF file into an array of document objects which is further processed to get what we need.

The library is convenient as it supports both parsing an existing file or a string with PDF data. It allows you to extract metadata and plain text from a document. You can test the library at its demo page.

The problem is that Sebastien’s library is based on old TCPDF version 6 parser which some day is going to be replaced by a newer rewrite called tc-lib-pdf-parser. However, that new parser is still under development and Sebastien’s is aware of its existence.

smalot/pdfparser has commercial support from Actualys.

FPDI

I got familiar with this library when I received a bug report for a watermarking module in some e-book system. The module received a PDF, parsed it using FPDI, generated a watermark with FPDF and stamped it over all pages.

The problem is that the free version of FPDI supports only PDF version 1.4 and below. To support higher document versions, you have to buy a full library. And that’s what the bug report was about. We decided to switch to another tool, pdftk, which is described below.

Command-line tools

The first command-line tool I played with was pdftk. I used it to join separate documents into one, apply watermarks and extract basic data, like a number of pages. It supports all PDF formats unlike FPDI library. The only thing that’s missing is a text extraction feature.

The need to extract plain text from a document led me to the Apache PDFBox library. It is written in Java and, as I described before, it offers some very nice features. However, in the PHP world we can only access a CLI wrapper for that library which has a limited set of options.

Later I discovered the Poppler library, which is said to fully support the ISO 32000-1 standard for PDF. This C++ library can be accessed via dedicated CLI tools – poppler-utils, which we can run from PHP. For example, the pdftotext tool gives a lot of control over the plain text dump – you can even preserve a proper document layout while rendering, or crop the document to a specified region. Also, pdfinfo provides comprehensive information about a file, like page format, encryption type etc. You can use it to extract JavaScript too.

Sometimes you might want to create a PNG or JPEG screenshot of a document. You can do it with pdftocairo from Poppler, or use ImageMagick’s convert.

Wrappers

For pdftk, check out this library: mikehaertl/php-pdftk.

PDFBox CLI can be accessed via schmengler/PdfBox.

Imagemagick and Ghostscript are the basis for spatie/pdf-to-image wrapper.

Poppler has several PHP wrapper libraries:

  • spatie/pdf-to-text only allows to extract text from a PDF. It requires an input PDF to exist in the file system. The library does not wrap additional input arguments, so you have to specify them manually.
  • ncjoes/poppler-php: a library supposed to wrap all poppler-utils, but at the moment pdftotext is still unsupported. Also, this library is not very convenient as it forces you to choose an output directory for a file (it does not return processed data as string).

In fact, these two libraries are wrappers to a wrapper, since poppler-utils are just a collection of CLI wrappers for the Poppler C++ library 😉

Which to pick? Native or CLI?

There are a couple of basic considerations.

Native PHP libraries should work independently from the host environment. They are a lot easier to set up and update. The only depedency tool you use is Composer.

CLI tools, especially these written in C/C++, might be faster and use less memory. However I don’t have strict evidence at the moment. Maybe all the optimizations that came with PHP 7 will make this point obsolete. Also, I believe that C/C++ tools have a wider audience and thus might receive more community support.

You should pick a tool that’s best for your specific requirements. Most tools will do a decent job while simply rendering an unencrypted PDF to an image or some plain text. But if you need to have more control on the output file structure or you want to process encrypted documents, poppler-utils will be a good choice.

Sometimes it occurs to me that many developers are just reinventing the wheel, especially when it comes to a multitude of PDF processing libraries for PHP. The Portable Document Format has almost seven hundred pages of specification. We are all struggling with the same processing issues. That’s why I rather prefer to choose the best tools in different technologies and connect them with interfaces rather than doggedly sticking to a single technology.

Check out the List of PDF software at Wikipedia.

Picking a PHP tool to generate PDFs

I spent a lot of time working with different tools to generate PDF files, mainly invoices and reports. Some of these documents were really sophisticated, including multi-page tables, colorful charts, headers and footers.

I know how hard it is to choose between a multitude of libraries and tools, especially when we need to do a non-trivial job. There is no silver bullet; some tools are better for certain jobs and not so good for other jobs. I will try to sum up what I’ve learned through the years.

Two ways of creating a PDF file

A PDF file contains a set of objects which make a document, like pieces of text, images, lines, forms, fonts, and so on. So creating a PDF is an act of putting all these pieces together in a proper order and layout. Most objects utilize a subset of PostScript commands, so you can even write them in your text editor.

One way is to create these objects “by hand”: we add every line of text separately. We draw all tables manually, calculating cell widths and spacings on our own. We must know when to split longer contents into multiple pages. This approach requires a lot of manual work and very good programming skills, so we don’t end up with a spaghetti code where it is hard to find any meaningful logic between all the drawing commands.

Another way is to convert one document, for example HTML, LaTeX or PostScript into PDF.

We used LaTeX for an education app which allowed composing tests for students from existing exercises prepared by professionals. Since LaTeX was a primary tool for our editors, it was natural for us to convert their scripts straight to PDF.

Converting HTML to PDF is way more complex as today’s web standards are having more and more features, just to mention CSS Flexbox or Grid layouts. Let’s see what we can do.

Native PHP libraries

My first experience was with native PHP libraries where you had to do most things by hand, like placing text in proper positions line by line, drawing rectangles, calculating table cells, and so on. It was quite fun at the time, but creating more robust documents turned out to be very hard. We used FPDF and ZendPdf libraries (the latter is discontinued).

At some point, I ended up maintaining multiple-page, sophisticated school reports with tables and charts rendered by ZendPdf. Business wanted to add even more types of reports. I decided to rewrite all reports as HTML documents with stylesheets and then try to make PDFs from that.

There are three PHP libraries capable of parsing HTML/CSS and transforming that to PDF:

Rendering HTML and CSS certainly isn’t easy. Modern browser engines are huge projects and I can’t imagine a fully functional rendering engine written in pure PHP. So you cannot expect these libraries to provide the same output you’re seeing in Firefox or Chrome. However, for simple layouts and formatting they should be enough. Plus is that you still do not depend on any external tools – just plain PHP!

To give you some idea of what to expect from above libraries, I compiled a comparison of an invoice renderings. These three pictures are made from the same HTML 5 source which utilizes CSS Flexbox to position “Seller” and “Buyer” sections next to each other. It has also some table formatting:

Google Chrome (reference image)
TCPDF

mPDF
Dompdf

As you can see, none of the PHP libraries understood CSS Flexbox. mPDF and TCPDF had some problems with painting the table. Dompdf performed the best and I’m pretty sure that making the “Seller” and “Buyer” sections the old-school way, like float or <table> would be enough to have a proper result.

External CLI tools

Native PHP solutions were not enough for me, so I decided to use an external tool backed by a fully functional, WebKit rendering engine. My employer was already using wkhtmltopdf which supports everything I needed: SVG images, multi-page tables, headers and footers with page numbers and section names, automatic bookmarks. Having old reports rewritten to HTML and CSS, I was able to implement all the new features requested by the business.

wkhtmltopdf certainly isn’t bug-free; for example, I had some issues with repeating table headers on consecutive pages. Also, upgrading from 0.12.3 to 0.12.4 broke my document layout which used dynamic headers and footers, so I had to go back to the old version.

Then I got familiar with PhantomJS, which was used mainly to conduct automatic browser tests in a headless mode (without the browser window). It could also capture PNG and PDF screenshots. PhantomJS used a newer version of the WebKit engine. However, the project is suspended now.

Almost a year before the suspension of PhantomJS, Google announced that Chrome can run in a headless mode from version 59. This means you can utilize the latest Blink rendering engine to convert HTML/CSS to PDF from your command line. This is perfect for rendering really complex documents utilizing latest web standards. The document looks exactly the same in your browser and in the final PDF file which makes development a lot easier.

Connecting PHP with external tools

The easiest way would be to execute an external tool as a shell command. You can do it with PHP functions like shell_exec or proc_open, but it’s not very convenient.

I recommend using symfony/process library and utilize standard streams whenever applicable. A process should accept input HTML through STDIN and send the resulting PDF via STDOUT. It can also produce some errors through STDERR. It might turn out that you won’t need any temporary files to do the job.

There are also several wrapper libraries, like phpwkhtmltopdf or KnpLabs/snappy.

For Chrome, consider using Browserless. You can choose between a free Docker image with pre-configured Chrome with dependencies, or a paid SaaS platform to convert your HTMLs to PDF. With the Docker image, it is really easy to send HTML and receive PDF via HTTP.

Conclusion

There is a wide choice of PHP libraries and external tools which can be used to dynamically create PDF files. You should choose a combination which suits your business needs. For simple documents, you don’t need a complex rendering engine. Save disk space, CPU and RAM!

Please also remember that many tools are developed by the Open Source community and receive little commercial support. They can be abandoned at any time or they might not support newest PHP version from day one (which can impede migrating the rest of your app). And your dependencies have dependencies too, so take a look at composer.json when picking a library.

And if your favorite Open Source tool does not do everything you need properly – maybe try contributing? It’s a community, after all.

Testing PDF documents

I’ve been wondering for some time if PDF is still a valid format. It’s “portable”, of course, but not in today’s meaning – it’s clearly not responsive. Like a fixed piece of paper transformed into a file. However, PDF still has many important use cases like storing invoices, reports or tickets. I spent a couple of years working on sophisticated PDF reports, and this year I even tried to test a process of generating invoices in some ad exchange system. I really wanted this system to be rock solid.

Of course there is no point in comparing a binary PDF file to an expected value. You can’t catch the exact differences in case of an error. I could create a PNG screenshot and compare it to the template, but I was a bit worried about the readability of such diff. A third way would be to verify the source HTML document used to render a PDF – but in fact, I was not interested in markup, but in an output data that landed inside a PDF.

Following my friend’s advice, I used another tool called Apache PDFBox. This robust library allows performing different operations on PDF documents: creating, merging, splitting, signing, filling forms etc. We decided to extract plain text from a file. It’s like we used a Select all command, copy and paste the text into Notepad.

PDDocument pdf = PDDocument.load(content);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setAddMoreFormatting(true);
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(pdf.getNumberOfPages());
String plainText = stripper.getText(pdf);
pdf.close();

assertThat(plainText).isEqualTo(expected);

A PDF document consists of blocks which can be ordered in a way that we did not really expect. Anyone who tried copying a table from PDF to Notepad experienced this. Luckily, PDFBox tries to help us organizing the blocks and formatting the plain text dump.

We made a lot of test scenarios using the above solution and they did a really good job catching all the little bugs in the data. It was crucial to detect any mistakes because our system was preparing financial documents. Moreover, the test reports were very readable.

The only problem with the above method is that it does not test the layout correctness. To achieve that, we could extract only specified regions from the document. In such case we assume that a rectangle with x1,y1,x2,y2 coordinates contain, for example, customer’s data:

Rectangle2D region = new Rectangle2D.Double(x1, y1, x2 - x1, y2 - y1);
PDDocument pdf = PDDocument.load(content);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("Contact region", region);
stripper.extractRegions(pdf.getPage(0));
String plainText = stripper.getTextForRegion("Contact region");
pdf.close();

assertThat(plainText).isEqualTo(expected);