Defensive coding: Avoiding mutability and side effects

Are you tired of fixing the same bugs on and on in a huge system developed by a multitude of developers? I guess it’s time to introduce some practices of Defensive Programming. It is an approach to improve software quality by reducing number of possible bugs and mistakes and making the code more predictable and comprehensible.

Before we continue, I advise that you get familiar with NASA’s Power of 10 Rules, designed for developing

You can also read about poka-yoke approach to design things in a way that prevents possible misuses. An example from a real world is a SIM card that can be inserted in only one position. Now think how many times, as a developer, you were misled by someone else’s code? How many times have you used wrong function, wrong variable, wrong argument, wrong format?

Writing reliable code is an art of avoiding mistakes. Let’s see how to do that.

Avoid mutability and side effects

How many variables are there in your code? How many of them are not used anymore? How many of them have misleading names?

Every variable can affect the behavior of your program (well, that’s what they’re designed for). Variables are meant to have different values over runtime. Are you sure you’re controlling all of them properly?

Here are some basic rules for variable handling:

  1. Do you really need a variable? Maybe a constant will be enough? In JavaScript, use const instead of let. In Java, use final to indicate the value can be assigned only once.
  2. Use the strictest scope possible. Avoid global variables.
  3. Avoid modifying object state and global state if not necessary. A function that modifies anything outside the scope of itself is creating a side effect. Side effects often cause unpredictable bugs and make testing more difficult. Even if you just poke the objects passed as function arguments, it is a side effect. At least be aware of the consequences.
  4. For simple values like date/time, use immutable value objects. For example, if you want to add 2 months to an existing date, create a new result object instead of modifying the existing one.
  5. Use precise and meaningful naming, not just i, j, k or temp. How can other developer know which variable is responsible for what? On the other hand, take context into account. Avoid lengthy names with unnecessary prefixes, like bigFancyModuleUser, bigFancyModuleProduct etc. If these variables exist only in the scope of BigFancyModule package, skip the prefix.

How I optimized a process from 35 to 5 hours

Most of my day job isn’t fascinating. Yet another controller, service, test, and so on. I spent a lof of time doing repetitive tasks and slowly gaining more knowledge about the system I’m working on. However, having slow and steady pace can eventually reward you with an opportunity to make a really great improvement. It happened to me, twice.

I spent four years maintaining a project with a 65 GB MySQL database and hundreds of millions of records. In the beginning, the system seemed to be very complex. It contained lots of legacy code and many classes turned out to be obsolete. I needed some time to raise my confidence with this project. After two years, I reduced the database size to 15 gigabytes without any data loss. Of course during my work, the database gained another millions of records, but that didn’t stop me from doing a stunning optimization.

It wasn’t a single database migration, but a series of small ones. It took me many months to come up with all the optimizations, sometimes subtle – but with 300 millions records, every byte counted. Database schema changes required also application code changes and I did not want to make too big PRs. Moreover, I couldn’t just execute a migration on a 60 GB table whenever I wanted. I had to agree with the Product Owner on a downtime. And, of course, I had to prepare backups and a rollback strategy.

Then I jumped to work on an advertising platform which had a complex invoicing system. Every night, a cron job was run to create and send PDF invoices. The process was supposed to finish in the morning, but it didn’t. People did not get their documents in time. I discovered that a single process can run up to 35 hours, even if just a few documents were made.

Again, I had to do several boring maintenance tasks before I had the courage to optimize the complex invoicing process. After gaining some basic system knowledge I noticed that the cron job did not have any tests. Every change required manual tests. So I spent some time writing unit and integration tests which helped me understand the process even more.

When I was ready to introduce changes, I talked to the Product Owner and he agreed to include that work in the upcoming sprint. I needed two weeks to do necessary measurements and experiments. In the end, I successfully deployed my changes and the process shrinked from 35 to 7 hours. I removed a lot of redundant database queries by simply verifying the boolean logic and the control flow. Are you aware of the way boolean expressions are evaluated? Knowing such nuts and bolts might really reward you.

How to conduct a successful process optimization

  1. Know the details of the business logic and code details of the project you are working on. To do so, maintain a steady workflow. Don’t be afraid of boring tasks – you have to start somewhere.
  2. Test the current behavior that you plan to optimize. Write automatic tests or at least prepare manual scenarios to cover as much cases as possible (not just happy paths). This will expand your project knowledge even further.
  3. Measure all steps of the current process. Which step is the slowest and why? How much time on average it takes to process a single entity? You need measures to know if you’re making any progress.
  4. Introduce changes in the code and verify them on your local or test environment.
  5. Gather feedback during code review. Maybe someone will notice dangerous changes you overlooked.
  6. Analyze what is needed for deployment. Will it require downtime? Any database migrations? Are there any other applications depending on your module/service/app?
  7. Practice deployment in a test or pre-production environment.
  8. Discuss the deployment time with the team and stakeholders.
  9. Good luck! Go with the deployment 🙂

Mixing office and remote workers in one organization

Remote work is challenging. People working remotely need perfect communication skills and discipline because no one is watching over their shoulder. However, the real struggle starts when we try to combine office and remote employees. What problems are we going to face and how can we improve the situation?

I was tired of working in an office, so switching to a remote job was a big relief to me. However, I still had to cooperate with people sitting in an office and it turned out to be quite a challenge for all of us. Luckily, with some understanding from everyone in the team, things improved quickly.

Let others know where and who you are

Your coworkers should know if you’re available or not. Set a status which says “In the office” or “Working remotely 9-5”. Use “Away” when you have a break and “Do not disturb” when you need some peace.

At some point, my organization forced everyone to set their photos as profile pictures on Slack. All the funny cats as avatars are now gone. It’s a good way to integrate people, especially when remote guys visit the office from time to time.

Improve conference calls between office and remote people

We often have calls where one group of people is sitting at the office and another group is connecting remotely. The biggest challenge is to create equal participation opportunities for everyone in the team.

Both office and remote people must have a good connection and a good microphone, so everyone can understand what other people are saying. Remote folks usually have headsets; please check their sound quality upfront! If they sound like an old telephone, buy something better.

The best table setup for a call with remote people. Every person maintains the same distance from a microphone. A webcam captures everyone at the table, so that remote participants see exactly what’s happening.

The office group can have a shared microphone on the table. You can find some good products with an omnidirectional mic and an integrated speaker for around $100. Quality matters even more because people are going to sit in some distance from the microphone, so remote guys will hear more room reflections. Sit around the microphone in an equal distance, so everyone can be heard equally loud.

When the office team joins a meeting, they share one user account. Remote people do not know who exactly is present in the room. The solution is simple: turn the camera on! The best solution is to have an external camera with an overall view of the conference room. If you don’t have it, just rotate a laptop whenever someone else is starting to speak.

Any new people should introduce themselves, like “Hi, I’m Mark, I’m resposible for X and I joined the meeting because …”

It’s good to know who’s sitting with us and why, and it’s nice to see people smiling, so remote people can launch their cameras too.

Share anything valuable hanging on the office walls

Sometimes people at the office find it convenient to draw something on the wall, or stick some cards here and there. Remote workers do not see these walls. You need to at least share a picture of any diagrams you made on that wall. Make sure remote folks are somehow able to contribute to those drawings.

The same goes with any printed announcements, like “Hey, we’re having a party tomorrow”. Of course if remote people can relate to them. You don’t have to share information about a broken coffee machine, for instance.

Meet in person from time to time

You should meet and have some fun together. The team’s mood is much better when you share memories from trips and parties. An organization can facilitate this by organizing different events, like trainings, conferences, lightning talks, etc. Of course you can also have your own initiatives, even just having pizza and beer.

Mixing office and remote workers can bring a lot of fun. It increases diversity because a company does not limit itself to hiring only the people preferring a specific location for work. However, it takes some practice to do it right and get rid of any communication obstacles.

Offboarding: How to quit the job gracefully

Recently I shared my thoughts on preparing an efficient onboarding process. Unfortunately, sooner or later people quit. This is also something our dev team should prepare for.

Most of the time, employees and contractors have a notice period which lasts usually from two weeks to three months. This might seem like enough time to transfer knowledge and duties, but my experience shows that it’s never enough.

Prepare for the offboarding period

When an experienced person brings their resignation letter, often a panic mode starts within the team. Suddenly everyone wants something and the person who quits has a really busy time.

Sometimes, ambition strikes. Knowing that the days of our current job are counted, we can fall into a mania of fixing everything. This can bring too much chaos into the team and the resulting value is not worth it. If you suddenly decide to update all libraries in the system, you can break a lot of things and drag people away from their current tasks.

There is a simple way to avoid such an “offboarding rush”. Your team should perform regular documentation updates, automate manual tasks, write tests and don’t wait too long with refactoring bad code. This should be a routine, not an exception.

You should avoid the so-called knowledge silos. It’s inevitable that developers in your team will specialize in different parts of the system, and that’s ok. However, you should faciliate information exchange when needed. This includes proper knowledge base, commit messages, branch names, issue descriptions, chat groups/channels etc.

If you create a hack or workaround somewhere, under time pressure, then at least describe it in a visible place. Other people will be aware of that hack’s existence when you’re not around. I’ve witnessed a lot of situations when experienced people left a lot of hacks even on production servers and then just quit. Such hacks break in the least appropriate time, like a peak of a sales season.

The same goes with every manual task done by a particular person, like manual deployment, setting up repositories and so on. These should be at least documented, so that someone can take over such duties. The best way is to automate as much as possible in a clean and descriptive manner.

Easy to say, huh? But developers often postpone maintenance tasks which are not a part of the client’s requirements. If stakeholders don’t yell at you when you don’t ship unit tests on time, then… you postpone them. They won’t complain about your internal documentation either. Stakeholders aren’t interested if your workplace is clean or dirty if they get a working product. But things can break soon if you don’t clean your mess!

Make a good use of the time that’s left

When the date of somebody’s leave is clear, the whole team should decide how to spend that time. You can organize knowledge sharing meetings or pair programming sessions. You can sit and discuss refactoring ideas that could be realized in the future. An experienced developer might share thoughts about further product development and possible problems.

Focus on the most important and valuable activites. As I said before, don’t try to fix everything at once. Also, a person that is about to quit should not do any more tasks on his or her own. The offboarding time should be spent on helping the team.

Be honest but polite

Quitting a job is often preceded by months or even years of frustration. There can be a lot of reasons for that, and when you finally make the decision to leave, you probably think it’s the only solution left.

In the first years of my career I made a mistake of not being honest about my work expectations, like salary or duties. Employment is all about two parties getting along with each other. You shouldn’t be afraid to talk honestly to your boss.

However, don’t burn bridges. You never know what happens in the future. Maybe you’ll meet your boss again in another company? Maybe some day you’ll receive an interesting offer from your boss?

Also, be polite and do not spread your frustration everywhere. Your team might still contain new, enthusiastic people. Don’t break their motivation.

Be responsible

The way people quit a job tells a lot about their social skills and emotional intelligence. Also, the way a team deals with people leaving the workplace speaks a lot about its practices. We all need to be responsible, mature and polite, so we can avoid sudden stress caused by someone quitting.

6 Steps to Effective Developer’s Onboarding

Who’s responsible for building a development team? A team leader? Scrum master? Senior engineers? I strongly believe that every team member contributes to its work culture. I also believe that one of the best benchmarks of a team’s performance is the way new developers are introduced.

Hiring a new developer is a big investment. Once a candidate passes all the recruitment steps and walks into our room, we want him or her to start proper work as soon as possible, so the investment would start returning. Most of the time we already know why we want that person even before he or she is hired. Perhaps we have certain expectations. Maybe an experienced team member just left, or we need more resources to deliver new features and bugfixes?

On the other hand, sometimes we are the ones who change jobs and need to dive into an unknown environment. We want to show off and bring value as early as possible to prove we are worth the time and money spent on hiring us.

Let’s see what we can do to improve the onboarding process in both scenarios.

Know the new employee (or employer)

A good onboarding process should use all the data gathered during recruitment. Our candidate already answered a lot of questions about his or her experience, favorite tools, career goals, and so on. Maybe he or she already solved a technical task or wrote a test prepared by our company? All of these data should be an input to the invididual onboarding process.

If you’re joining a company, you should also gather as much informations as possible. Help improving the onboarding by asking important questions during recruitment: what tools do you use, what is your work culture, how an average day looks like, and so on

If the company’s HR department or agency did not provide you enough details, ask for them. You should know as much as possible about your future coworker or workplace.

Choose a mentor

It’s good to have a friend in a new workplace. Most people cannot memorize the names and functions of everyone around in the first day. Onboarding is easier when a new employee have one person to ask most questions, like “where is this or that in the office” or “who should I ping about my Git access”.

But a mentor is someone more than that. A mentor helps setting career goals and evaluating them. A mentor tries to compromise company goals with the individual goals of an employee. A mentor shares experience and helps developing a position in the organization. Of course mentoring takes some time and attention away from pure coding. But programming is a team sport, right?

A mentor does not have to be that one senior developer who’s been here for ages. Even regular developers who spent only a couple of months in the team can guide new folks. A mentor should not pretend to know everything. It’s all about willing to work together and help out.

It’s good to spread the mentoring duties across different developers, so we don’t end up with one senior trying to guide the whole team. A bottleneck is made when all the questions and decisions go through a single person. Your team should learn self-organization.

If you’re about to join a team, ask who is going to help you in your first days. Maybe you could meet that person earlier and have a chat?

Prepare an onboarding plan

Don’t improvise. Every developer needs some basic stuff to work: a computer, a desk, a comfortable chair, internet and/or intranet connection, software licenses, e-mail account, repository and knowledge base accesses, and so on. Arrange them in advance.

Plan at least the first day which probably will consist of introducing a new member into the team, meeting some people, talking about the project and team habits and maybe visiting HR department for some additional paperwork.

Think about the first tasks a new member could be assigned to. A simple label change or a trivial bugfix would be the best. In the beginning, everyone has to know the workflow and team practices. A fresh developer will surely ask these questions:

  • How do I clone repositories? Do we use SSH keys?
  • How do I create my branch? How do I name it?
  • Should I include an issue number in the commit message?
  • Should I rebase my branch before pushing?
  • How do I launch tests and check their results?
  • Who should I ask for a code review?

Onboarding will take some time for sure, so remember to reserve it in your team’s calendar.

Review your documentation and communication practices

Everything that makes the work of experienced developers easier will also help the newcomers. It’s good to have some basic piece of documentation we can refer to. Check if it’s up-to-date and if it contains at least basic information which can help fresh people.

I’m not a fan of robust, official company documents. Often a simple glossary of terms and some architecture diagrams should be enough for newcomers to understand everyday conversations.

Consider improving your team’s communication. How do people communicate? Are there knowledge silos? Do you use public channels on Slack or maybe all decisions are secretly made in direct messages? How do you use an issue tracker? Do issues have precise descriptions?

Introduce a new member to the team, the project and the company

Most people cannot memorize more than a few names in the first day, but you can make it a bit easier. I once joined a team where everyone had different nicknames and avatars across Slack and GitHub. It’s good to have them unified. Later, my employer decided that everyone should put his or her photo as an avatar, instead of funny cats and other weird stuff. This made recognizing people in the hallway a lot easier.

A great way to increase people’s motivation is to let them know the details of the business they work for. Tell them a bit about the company and product history. Provide a business context and describe the reason this team exists. Maybe a new person could talk to someone else than just fellow developers? Do a product demo.

If you just joined a new team, be curious! Don’t hesitate to ask questions.

Establish goals and evaluate them

Our work should serve a purpose and every team member should feel it. Set individual SMART goals for upcoming 3-6 months, then provide support and inspect progress.

I believe that everyone can contribute to the organization from the first day. Discuss what specific value a person can bring in the beginning. I can think of:

  • Update documentation, especially the part about setting the development environment.
  • Write tests. In one project I started from writing numerous unit and integration tests for a module refactored by my experienced colleague. We cooperated closely and soon I started finding bugs in his code.
  • Review pull requests. Why not? Good questions from a newbie might help finding weak sides of PRs made by experienced developers. It can be also a nice variation of rubber duck debugging.
  • Pair programming sessions whenever it’s feasible to split work for two people.

These are some basic goals, but you should also agree on something more related to the specific product(s) your team is developing.

Wrapping up

Onboarding is an exciting process and has a massive effect on the team’s and each individual’s performance. It should be well-planned and conducted in a friendly, calm atmosphere. This is the time when people make new connections which will be crucial for their later work. It is also a great moment to review and improve team’s practices.

Picking a PHP tool to read and manipulate PDF files

Last updated: January 18th, 2020

TL;DR For simple PDF text and metadata extraction, use pdfparser. For advanced options, try pdftotext and pdfinfo from Poppler. To join or split PDF files, encrypt them or apply watermarks, use pdftk. To make a JPEG or PNG screenshot of a PDF, use ImageMagick or pdftocairo.

In the previous article I described several tools that can be used together with PHP to create PDF files. Back then, the choice was not easy and we had a lot of criteria to consider while picking the best tool. Today we will browse possibilities to read and edit existing PDF files.

Native PHP libraries

Again, we will start from checking if there are any PHP libraries to manipulate PDF files without depending on external binary tools.

pdfparser

There is an interesting library called smalot/pdfparser. It has almost 1000 stars on GitHub. It utilizes TCPDF Parser to parse a PDF file into an array of document objects which is further processed to get what we need.

The library is convenient as it supports both parsing an existing file or a string with PDF data. It allows you to extract metadata and plain text from a document. You can test the library at its demo page.

The problem is that Sebastien’s library is based on old TCPDF version 6 parser which some day is going to be replaced by a newer rewrite called tc-lib-pdf-parser. However, that new parser is still under development and Sebastien’s is aware of its existence.

smalot/pdfparser has commercial support from Actualys.

FPDI

I got familiar with this library when I received a bug report for a watermarking module in some e-book system. The module received a PDF, parsed it using FPDI, generated a watermark with FPDF and stamped it over all pages.

The problem is that the free version of FPDI supports only PDF version 1.4 and below. To support higher document versions, you have to buy a full library. And that’s what the bug report was about. We decided to switch to another tool, pdftk, which is described below.

Command-line tools

The first command-line tool I played with was pdftk. I used it to join separate documents into one, apply watermarks and extract basic metadata, like a number of pages. It supports all PDF formats unlike FPDI library. The only thing that’s missing is a text extraction feature.

The need to extract plain text from a document led me to the Apache PDFBox library. It is written in Java and, as I described before, it offers some very nice features. However, in the PHP world we can only access a CLI wrapper for that library which has a limited set of options.

Later I discovered the Poppler library, which is said to fully support the ISO 32000-1 standard for PDF. This C++ library can be accessed via dedicated CLI tools – poppler-utils, which we can run from PHP. For example, the pdftotext tool gives a lot of control over the plain text dump – you can even preserve a proper document layout while rendering, or crop the document to a specified region. Also, pdfinfo provides comprehensive information about a file, like page format, encryption type etc. You can use it to extract JavaScript too.

Sometimes you might want to create a PNG or JPEG screenshot of a document. You can do it with pdftocairo from Poppler, or use ImageMagick’s convert.

Wrappers

For pdftk, check out this library: mikehaertl/php-pdftk.

PDFBox CLI can be accessed via schmengler/PdfBox.

Imagemagick and Ghostscript are the basis for spatie/pdf-to-image wrapper.

Poppler has several PHP wrapper libraries:

  • spatie/pdf-to-text only allows to extract text from a PDF. It requires an input PDF to exist in the file system. The library does not wrap additional input arguments, so you have to specify them manually.
  • ncjoes/poppler-php: a library supposed to wrap all poppler-utils, but at the moment pdftotext is still unsupported. Also, this library is not very convenient as it forces you to choose an output directory for a file (it does not return processed data as string).

In fact, these two libraries are wrappers to a wrapper, since poppler-utils are just a collection of CLI wrappers for the Poppler C++ library 😉

Which to pick? Native or CLI?

There are a couple of basic considerations.

Native PHP libraries should work independently from the host environment. They are a lot easier to set up and update. The only depedency tool you use is Composer.

CLI tools, especially these written in C/C++, might be faster and use less memory. However I don’t have strict evidence at the moment. Maybe all the optimizations that came with PHP 7 will make this point obsolete. Also, I believe that C/C++ tools have a wider audience and thus might receive more community support.

You should pick a tool that’s best for your specific requirements. Most tools will do a decent job while simply rendering an unencrypted PDF to an image or some plain text. But if you need to have more control on the output file structure or you want to process encrypted documents, poppler-utils will be a good choice.

Sometimes it occurs to me that many developers are just reinventing the wheel, especially when it comes to a multitude of PDF processing libraries for PHP. The Portable Document Format has almost seven hundred pages of specification. We are all struggling with the same processing issues. That’s why I rather prefer to choose the best tools in different technologies and connect them with interfaces rather than doggedly sticking to a single technology.

Check out the List of PDF software at Wikipedia.

Picking a PHP tool to generate PDFs (2020 update)

Last updated: January 31st, 2020

TL;DR For HTML to PDF conversion, use Dompdf library if you don’t need CSS Flexbox or Grid layouts. Neither Dompdf, mpdf, TCPDF nor wkhtmltopdf supports Flexbox or Grid. Use Google Chrome in headless mode if you need modern CSS rules. Consider browserless.

I spent a lot of time working with different tools to generate PDF files, mainly invoices and reports. Some of these documents were really sophisticated, including multi-page tables, colorful charts, headers and footers. I tried generating documents by hand and converting HTML to PDF, or even LaTeX to PDF.

I know how hard it is to choose between a multitude of libraries and tools, especially when we need to do a non-trivial job. There is no silver bullet; some tools are better for certain jobs and not so good for other jobs. I will try to sum up what I’ve learned through the years.

Two ways of creating a PDF file

A PDF file contains a set of objects which make a document, like pieces of text, images, lines, forms, fonts, and so on. So creating a PDF is an act of putting all these pieces together in a proper order and layout. Most objects utilize a subset of PostScript commands, so you can even write them in your text editor.

One way is to create these objects “by hand”: we add every line of text separately, we draw all tables manually, calculating cell widths and spacings on our own. We must know when to split longer contents into multiple pages. This approach requires a lot of manual work and very good programming skills, so we don’t end up with a spaghetti code where it is hard to find any meaningful logic between all the drawing commands.

Another way is to convert one document, for example HTML, LaTeX or PostScript into PDF.

We used LaTeX for an education app which allowed composing tests for students from existing exercises prepared by professionals. Since LaTeX was a primary tool for our editors, it was natural for us to convert their scripts straight to PDF.

Converting HTML to PDF is way more complex as today’s web standards are having more and more features, just to mention CSS Flexbox or Grid layouts. Let’s see what we can do.

Native PHP libraries

My first experience was with native PHP libraries where you had to do most things by hand, like placing text in proper positions line by line, drawing rectangles, calculating table cells, and so on. It was quite fun at the time, but creating more robust documents turned out to be very hard. We used FPDF and ZendPdf libraries (the latter is discontinued).

At some point, I ended up maintaining multiple-page, sophisticated school reports with tables and charts rendered by ZendPdf. Business wanted to add even more types of reports. I decided to rewrite all reports as HTML documents with stylesheets and then try to make PDFs from that.

There are three PHP libraries capable of parsing HTML/CSS and transforming that to PDF:

Rendering HTML and CSS certainly isn’t easy, so you cannot expect these libraries to provide the same output you’re seeing in Firefox or Chrome. However, for simple layouts and formatting they should be enough. Plus is that you still do not depend on any external tools – just plain PHP!

To give you some idea of what to expect from above libraries, I compiled a comparison of an invoice renderings. These three pictures are made from the same HTML 5 source which utilizes CSS Flexbox to position “Seller” and “Buyer” sections next to each other. It has also some table formatting:

Google Chrome (reference image)
TCPDF

mPDF
Dompdf

As you can see, none of the PHP libraries understood CSS Flexbox. mPDF and TCPDF had some problems with painting the table. Dompdf performed the best and I’m pretty sure that making the “Seller” and “Buyer” sections the old-school way, like float or <table> would be enough to have a proper result.

External CLI tools

Native PHP solutions were not enough for me, so I decided to use an external tool backed by a fully functional, WebKit rendering engine. My employer was already using wkhtmltopdf which supports everything I needed: SVG images, multi-page tables, headers and footers with page numbers and section names, automatic bookmarks. Having old reports rewritten to HTML and CSS, I was able to implement all the new features requested by the business.

wkhtmltopdf certainly isn’t bug-free; for example, I had some issues with repeating table headers on consecutive pages. Also, upgrading from 0.12.3 to 0.12.4 broke my document layout which used dynamic headers and footers, so I had to go back to the old version.

Then I got familiar with PhantomJS, which was used mainly to conduct automatic browser tests in a headless mode (without the browser window). It could also capture PNG and PDF screenshots. PhantomJS used a newer version of the WebKit engine. However, the project is suspended now.

Almost a year before the suspension of PhantomJS, Google announced that Chrome can run in a headless mode. This means you can utilize the latest Blink rendering engine to convert HTML/CSS to PDF from your command line. This is perfect for rendering really complex documents utilizing latest web standards. The document looks exactly the same in your browser and in the final PDF file which makes development a lot easier.

Connecting PHP with external tools

The easiest way would be to execute an external tool as a shell command. You can do it with PHP functions like shell_exec or proc_open, but it’s not very convenient.

I recommend using symfony/process library and utilize standard streams whenever applicable. A process should accept input HTML through STDIN and send the resulting PDF via STDOUT. It can also produce some errors through STDERR. It might turn out that you won’t need any temporary files to do the job.

There are also several wrapper libraries, like phpwkhtmltopdf or KnpLabs/snappy.

For Chrome, consider using Browserless. You can choose between a free Docker image with pre-configured Chrome with dependencies, or a paid SaaS platform to convert your HTMLs to PDF. With the Docker image, it is really easy to send HTML and receive PDF via HTTP.

Conclusion

There is a wide choice of PHP libraries and external tools which can be used to dynamically create PDF files. You should choose a combination which suits your business needs. For simple documents, you don’t need a complex rendering engine. Save disk space, CPU and RAM!

Please also remember that many tools are developed by the Open Source community and receive little commercial support. They can be abandoned at any time or they might not support newest PHP version from day one (which can impede migrating the rest of your app). And your dependencies have dependencies too, so take a look at composer.json when picking a library.

And if your favorite Open Source tool does not do everything you need properly – maybe try contributing? It’s a community, after all.

Testing PDF documents

I’ve been wondering for some time if PDF is still a valid format. It’s “portable”, of course, but not in today’s meaning – it’s clearly not responsive. Like a fixed piece of paper transformed into a file. However, PDF still has many important use cases like storing invoices, reports or tickets. I spent a couple of years working on sophisticated PDF reports, and this year I even tried to test a process of generating invoices in some ad exchange system. I really wanted this system to be rock solid.

Of course there is no point in comparing a binary PDF file to an expected value. You can’t catch the exact differences in case of an error. I could create a PNG screenshot and compare it to the template, but I was a bit worried about the readability of such diff. A third way would be to verify the source HTML document used to render a PDF – but in fact, I was not interested in markup, but in an output data that landed inside a PDF.

Following my friend’s advice, I used another tool called Apache PDFBox. This robust library allows performing different operations on PDF documents: creating, merging, splitting, signing, filling forms etc. We decided to extract plain text from a file. It’s like we used a Select all command, copy and paste the text into Notepad.

PDDocument pdf = PDDocument.load(content);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setAddMoreFormatting(true);
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(pdf.getNumberOfPages());
String plainText = stripper.getText(pdf);
pdf.close();

assertThat(plainText).isEqualTo(expected);

A PDF document consists of blocks which can be ordered in a way that we did not really expect. Anyone who tried copying a table from PDF to Notepad experienced this. Luckily, PDFBox tries to help us organizing the blocks and formatting the plain text dump.

We made a lot of test scenarios using the above solution and they did a really good job catching all the little bugs in the data. It was crucial to detect any mistakes because our system was preparing financial documents. Moreover, the test reports were very readable.

The only problem with the above method is that it does not test the layout correctness. To achieve that, we could extract only specified regions from the document. In such case we assume that a rectangle with x1,y1,x2,y2 coordinates contain, for example, customer’s data:

Rectangle2D region = new Rectangle2D.Double(x1, y1, x2 - x1, y2 - y1);
PDDocument pdf = PDDocument.load(content);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("Contact region", region);
stripper.extractRegions(pdf.getPage(0));
String plainText = stripper.getTextForRegion("Contact region");
pdf.close();

assertThat(plainText).isEqualTo(expected);

Java: Integer or int?

When I first saw primitive types like int or boolean mixed up with classes, I was very tempted to convert all primitives into IntegerBoolean and so on to maintain a clear coding style. But reading articles on the Internet and IntelliJ hints stopped me from doing such a stupid thing.

I read this wonderful rule of thumb:

Don’t create unnecessary objects.

Primivites always occupy less memory than objects, which was clearly described in this article on primitives. Maybe we don’t usually care about the memory usage, but if we process big amounts of data, every byte can count. Moreover, accessing primitives is faster because they are stored on the stack, not on the heap.

It’s also important to remember that a variable pointing to an Integer or Boolean object can be null. A primitive can’t. This difference matters for example when you retrieve data from an SQL database which allows NULLs. If you try to assign null to a primitive, you’ll get a Null Pointer Exception. That’s a bug I noticed and fixed in a production system. So I always try to restrict the possibility to use NULLs in the database and in the code unless null has a real business meaning.

NPE: Converting a list of objects to a map

Like every Java developer, I had a fair amount of Null Pointer Exceptions in my career. That’s why I decided to describe real-world examples I had to fix.

Today I will show you how to get an NPE while retrieving records from a database and creating a map based on these records, using streams and collectors introduced in Java 8.

Let’s say we have a Setting class:

public class Setting {
private final int accountId;
private final String type;
private final String value;

public Setting(int accountId, String type, String value) {
this.accountId = accountId;
this.type = type;
this.value = value;
}

public String type() {
return type;
}

public String value() {
return value;
}
}

Now, in another part of the system we will retrieve Setting objects from a database. We haven’t noticed that the SQL table allows NULL in the value column. Let’s pretend our list of settings looks like that:

List<Setting> settings = new ArrayList<>();
settings.add(new Setting(1, "site_title", "My blog"));
settings.add(new Setting(1, "site_description", "A place with fresh ideas"));
settings.add(new Setting(1, "site_copyright", null));

We want to easily access settings by names, so we create a map using a stream:Ponieważ chcemy mieć łatwy dostęp do ustawień o konkretnej nazwie, tworzymy mapę za pomocą strumienia:

Map<String, String> settingsMap = settings
.stream()
.collect(Collectors.toMap(Setting::type, Setting::value));

Bang! We’ve got a NPE here because one of the values used to create a map is null. It turns out that Collectors.toMap() does not handle NULLs.

Let’s think: do we really need a NULL in this case? I guess not. A NULL value means “no data” or “unknown value”, which is the same as if the setting did not exist. We should set the value column as NOT NULL. We can additionally filter the settings list:

Map<String, String> settingsMap = settings
.stream()
.filter(setting -> null != setting.value())
.collect(Collectors.toMap(Setting::type, Setting::value));

A NULL would make sense for example if we wanted to create a map between people’s names and their ages. If we don’t know someone’s age, then a null seems reasonable. We would have then to create a map in a different way.

You can have a further read on StackOverflow. Alternative ways to convert a list to a map can be found on Baeldung.