Testing PDF documents

I’ve been wondering for some time if PDF is still a valid format. It’s “portable”, of course, but not in today’s meaning – it’s clearly not responsive. Like a fixed piece of paper transformed into a file. However, PDF still has many important use cases like storing invoices, reports or tickets. I spent a couple of years working on sophisticated PDF reports, and this year I even tried to test a process of generating invoices in some ad exchange system. I really wanted this system to be rock solid.

Of course there is no point in comparing a binary PDF file to an expected value. You can’t catch the exact differences in case of an error. I could create a PNG screenshot and compare it to the template, but I was a bit worried about the readability of such diff. A third way would be to verify the source HTML document used to render a PDF – but in fact, I was not interested in markup, but in an output data that landed inside a PDF.

Following my friend’s advice, I used another tool called Apache PDFBox. This robust library allows performing different operations on PDF documents: creating, merging, splitting, signing, filling forms etc. We decided to extract plain text from a file. It’s like we used a Select all command, copy and paste the text into Notepad.

PDDocument pdf = PDDocument.load(content);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setAddMoreFormatting(true);
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(pdf.getNumberOfPages());
String plainText = stripper.getText(pdf);
pdf.close();

assertThat(plainText).isEqualTo(expected);

A PDF document consists of blocks which can be ordered in a way that we did not really expect. Anyone who tried copying a table from PDF to Notepad experienced this. Luckily, PDFBox tries to help us organizing the blocks and formatting the plain text dump.

We made a lot of test scenarios using the above solution and they did a really good job catching all the little bugs in the data. It was crucial to detect any mistakes because our system was preparing financial documents. Moreover, the test reports were very readable.

The only problem with the above method is that it does not test the layout correctness. To achieve that, we could extract only specified regions from the document. In such case we assume that a rectangle with x1,y1,x2,y2 coordinates contain, for example, customer’s data:

Rectangle2D region = new Rectangle2D.Double(x1, y1, x2 - x1, y2 - y1);
PDDocument pdf = PDDocument.load(content);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("Contact region", region);
stripper.extractRegions(pdf.getPage(0));
String plainText = stripper.getTextForRegion("Contact region");
pdf.close();

assertThat(plainText).isEqualTo(expected);

Why we adopted a coding standard and how we enforce it

Everyone might have their own code formatting preferences. Problems arise when a team consisting of a couple of individuals work on a common code base and every developer has different preferences. It’s hard to maintain pull requests in this situation.

The most ridiculous and unproductive quarrels in my career were about whether we should use tabs or spaces; should we place a curly bracket in the same line or another; should we leave a blank line at the end of file; etc. It’s mostly a matter of a personal preference, however both sides had some interesting arguments.

Once I told my team that we should adopt common code formatting rules for PHP. It was clear to us that we had a mess in our repos. I asked if we wanted to waste our time endlessly discussing every formatting detail. This way we adopted PSR-2 standard.

Of course we couldn’t just reformat all our repos at once. Every one of seven PHP developers already worked on their branches. We had new things merged to develop or master a couple of times per day. This is why we were reformatting our code piece by piece. We were choosing the best moments to do a global reformat in PhpStorm. It was easiest when we knew that only one developer at a time was working on a certain repo or module.

It is important to place a code reformatting operation in a separate commit. You shouldn’t mix formatting changes with functional changes because it makes browsing pull requests difficult. Some Git GUIs can hide whitespace changes but they cannot hide syntax changes like array() to [].

Automatic code checks and analysis

To ensure that our coding standards are met, we set up PHP Code Sniffer in some repos. It verifies that the code complies to the PSR-2 coding standard. After every push to the central repository we get a message about formatting errors. Our testing and analysis tools are automatically launched by the CI (Continuous Integration) tool, for example GitLab, Jenkins, Travis, CircleCI, Bamboo.

You should notice that phpcs by default enforces that no classes remain in the global namespace. It’s good because, believe me, code can grow really fast and it becomes more and more cumbersome to understand the sophisticated code base without proper namespace structure – best done according to PSR-4.

We can try some other static code analysis tools as well: PHPStan, PHPMetrics etc. They do very good job of finding basic and most common code flaws. During the pull requests we can focus on checking advanced business logic because we know that the code formatting and basic code smells were already checked automatically. It’s important especially for dynamically-typed, interpreted languages like PHP where there is no compilation process. Also, PHPStan helps us prepare the code for new PHP versions.

How unit tests help changing existing code

You should have some tests. Why? Because developers are afraid to make changes in a code they don’t understand. This is a common problem not only with fresh employees, but with everyone who stands in front of a huge, complex, legacy system. They are afraid to add a new if not to break the other ones. They are afraid to erase code that seems obsolete.

Instead of refactoring, people tend to add classes and methods with suffixes like _new_2 and so on. I saw the same fear of changes in QA and op teams. When I try to refactor things, I often hear: Oh, maybe not now in case if something breaks, maybe later…

Overcoming the fear of changing an unknown code

I had a problem changing a piece of complex code adding symbols to online transactions. The symbols depended on product types, legal issues, invoices etc. The sales dep coined some weird terms which developers didn’t understand, and those terms were very important for them. I was told to add yet another weird symbol on top of that. The original code looked like this:

public function getDescription()
{
if (\in_array($this->getItemsType(), [self::ORDER_MATERIAL, self::ORDER_MIXED], true)) {
$type = 'DW';
} elseif (self::ORDER_COURSE === $this->getItemsType()) {
$type = 'SzO';
} else {
$type = $this->hasInvoice() ? 'M' : 'P';
}

return sprintf('Order #%u / %s', $this->getId(), $type);
}

After analyzing the original code, I wrote a unit test for all cases:

use PHPUnit\Framework\TestCase;
use Piotr\Blog\Entity\Order;

final class OrderTest extends TestCase
{
/**
* Test data will be provided by ordersProvider().
* @dataProvider ordersProvider
*/

public function testDescription(int $orderId, int $itemsType, bool $hasInvoice, string $expected)
{
$order = new Order($orderId);
$order->setItemsType($itemsType)->setHasInvoice($hasInvoice);
$this->assertEquals($expected, $order->getDescription());
}

public function ordersProvider()
{
return [
[123, Order::ORDER_MULTIMEDIA, false, 'Order #123 / P'],
[123, Order::ORDER_MATERIAL, false, 'Order #123 / DW'],
[123, Order::ORDER_MIXED, false, 'Order #123 / DW'],
[123, Order::ORDER_COURSE, false, 'Order #123 / SzO'],
[123, Order::ORDER_MULTIMEDIA, true, 'Order #123 / M'],
[123, Order::ORDER_MATERIAL, true, 'Order #123 / DW'],
[123, Order::ORDER_MIXED, true, 'Order #123 / DW'],
[123, Order::ORDER_COURSE, true, 'Order #123 / SzO'],
];
}
}

I checked code coverage for this method, by curiosity. It was 100% which made me confident that all lines are executed during the test (however, it might not ensure that all cases are checked). Now I was ready to add another condition to the getDescription() method:

public function getDescription()
{
if (self::ORDER_VIDEO === $this->getItemsType) {
$type = $this->hasInvoice() ? 'W' : 'WP';
}
/* ... */
}

I ran my unit test and received no errors. Success – I didn’t break anything! Since I added some extra lines to my class, the code coverage dropped. I needed to add new test cases:

public function ordersProvider()
{
return [
/* ... */
[123, Order::ORDER_VIDEO, false, 'Order #123 / WP'],
[123, Order::ORDER_VIDEO, true, 'Order #123 / W'],
];
}

Test is passing and my code coverage is 100% again. Now I know that another developer who takes this code over will have a trustful test checking all conditions.

Make sure your business works!

The above example is easy. Unfortunately, in every day job writing unit tests is hard if we need to deal with a poor system architecture, tight coupling and… managers refusing the team to spend extra time writing tests. Moreover, unit tests alone might not be sufficient – you would want integration and UI tests also, and it takes time. But, as Robert C. Martin ironically pointed out once:

Can you imagine telling your users: You know, I don’t write tests. I just write the code. Sometimes it even works. And I ship it to you, and if there are bugs, you’re going to tell me, aren’t you?

Your company might not need a 100% code coverage – often it’s very difficult and even unprofitable (some people claim it’s dangerous). From the business perspective, we should write tests that protect the key business issues first – and that’s usually the domain logic. We want to make sure that none of the business rules that we agreed upon will break after deploying new features. We want to make sure that users will still be able do place orders, and all the orders will be properly accounted.

However if you want to write tests for business rules, then those rules have to be clearly written in the code. This is often cumbersome – and I’m going to tell you about it, some day.