Why I didn’t end up being a database engineer

Once upon a time, looking for new career opportunities, I considered becoming a database engineer or architect. I was pretty successful in optimizing MySQL databases growing up to 65 GB. I liked it. However I noticed almost no one was hiring MySQL database engineers – it’s too simple. Real challenges await those familiar with Oracle and big finance systems, aviation etc. So, I bought a thick book about Oracle and… never used it.

The only profit I got from buying an Oracle guide was another book I was offered in package: “SQL Antipatterns: Avoiding the Pitfalls of Database Programming” by Bill Karwin. It provided me a lot of valuable tips which I still eagerly keep on sharing with my coworkers.

Is database the center of every app?

Anyway, I used to believe that a database is in the center of every application. Most of them process data, right? So I started all projects from conceiving a database model: relations, data types, queries, and so on. I spent a lot of time optimizing data structures.

My approach had several pros. First of all, it was intuitive for other developers because a good database model made them immediately understand the purpose of the system and navigate easily through the rising codebase. Secondly, a good model helped datasets grow significantly (like hundreds of millions of records) without performance loss. Moreover, with correct foreign keys and other constraints, data remained consistent.

I have a funny story about a team of developers which refrained from using foreign key constraints. They used ON DELETE CASCADE clause everywhere and they accidentally wiped away half of the test database. It was such a shock for them they removed all the foreign keys. Soon, data became terribly inconsistent and a lot of bugs surfaced in the apps depending on that database.

Carving a SQL giant

Optimizing databases was cool, but there was one bad practice I really disliked looking at some DBAs work. After I read about SOLID principles, Domain-Driven Design and other stuff, I started to hate SQL stored procedures, especially when they were too long.

I saw some MS SQL devs trying to handle all the business logic with SQL scripts. These systems already had many bugs, and my fellow devs just kept on adding more CASE…IF blocks after requests from the business people. After some time, it was nearly impossible to add or change anything without breaking stuff.

Such situations made me realize that you have to choose the right tools for the job, not the other way around. Of course data is a huge asset for every company, but behavior is important too. Is SQL a right tool to model complex business behaviors?

Getting to know the big picture

For me, data has never been the biggest challenge. It is always the sophisticated business logic that needs to be modelled very carefully. It evolves over time, and we have to make sure all developers (especially the new ones) dig it quickly and flawlessly.

I decided to improve my process modelling skills with OOP and FP, know SOLID, DDD and other widely adopted principles or design patterns. I decided to share business knowledge with all my teammates to make sure they understand the purpose of their work and nature of the surrounding business. Of course SQL knowledge is valuable too, but it’s not enough.

How I optimized a process from 35 to 5 hours

Most of my day job isn’t fascinating. Yet another controller, service, test, and so on. I spent a lof of time doing repetitive tasks and slowly gaining more knowledge about the system I’m working on. However, having slow and steady pace can eventually reward you with an opportunity to make a really great improvement. It happened to me, twice.

I spent four years maintaining a project with a 65 GB MySQL database and hundreds of millions of records. In the beginning, the system seemed to be very complex. It contained lots of legacy code and many classes turned out to be obsolete. I needed some time to raise my confidence with this project. After two years, I reduced the database size to 15 gigabytes without any data loss. Of course during my work, the database gained another millions of records, but that didn’t stop me from doing a stunning optimization.

It wasn’t a single database migration, but a series of small ones. It took me many months to come up with all the optimizations, sometimes subtle – but with 300 millions records, every byte counted. Database schema changes required also application code changes and I did not want to make too big PRs. Moreover, I couldn’t just execute a migration on a 60 GB table whenever I wanted. I had to agree with the Product Owner on a downtime. And, of course, I had to prepare backups and a rollback strategy.

Then I jumped to work on an advertising platform which had a complex invoicing system. Every night, a cron job was run to create and send PDF invoices. The process was supposed to finish in the morning, but it didn’t. People did not get their documents in time. I discovered that a single process can run up to 35 hours, even if just a few documents were made.

Again, I had to do several boring maintenance tasks before I had the courage to optimize the complex invoicing process. After gaining some basic system knowledge I noticed that the cron job did not have any tests. Every change required manual tests. So I spent some time writing unit and integration tests which helped me understand the process even more.

When I was ready to introduce changes, I talked to the Product Owner and he agreed to include that work in the upcoming sprint. I needed two weeks to do necessary measurements and experiments. In the end, I successfully deployed my changes and the process shrinked from 35 to 7 hours. I removed a lot of redundant database queries by simply verifying the boolean logic and the control flow. Are you aware of the way boolean expressions are evaluated? Knowing such nuts and bolts might really reward you.

How to conduct a successful process optimization

  1. Know the details of the business logic and code details of the project you are working on. To do so, maintain a steady workflow. Don’t be afraid of boring tasks – you have to start somewhere.
  2. Test the current behavior that you plan to optimize. Write automatic tests or at least prepare manual scenarios to cover as much cases as possible (not just happy paths). This will expand your project knowledge even further.
  3. Measure all steps of the current process. Which step is the slowest and why? How much time on average it takes to process a single entity? You need measures to know if you’re making any progress.
  4. Introduce changes in the code and verify them on your local or test environment.
  5. Gather feedback during code review. Maybe someone will notice dangerous changes you overlooked.
  6. Analyze what is needed for deployment. Will it require downtime? Any database migrations? Are there any other applications depending on your module/service/app?
  7. Practice deployment in a test or pre-production environment.
  8. Discuss the deployment time with the team and stakeholders.
  9. Good luck! Go with the deployment 🙂

Java: Integer or int?

When I first saw primitive types like int or boolean mixed up with classes, I was very tempted to convert all primitives into IntegerBoolean and so on to maintain a clear coding style. But reading articles on the Internet and IntelliJ hints stopped me from doing such a stupid thing.

I read this wonderful rule of thumb:

Don’t create unnecessary objects.

Primivites always occupy less memory than objects, which was clearly described in this article on primitives. Maybe we don’t usually care about the memory usage, but if we process big amounts of data, every byte can count. Moreover, accessing primitives is faster because they are stored on the stack, not on the heap.

It’s also important to remember that a variable pointing to an Integer or Boolean object can be null. A primitive can’t. This difference matters for example when you retrieve data from an SQL database which allows NULLs. If you try to assign null to a primitive, you’ll get a Null Pointer Exception. That’s a bug I noticed and fixed in a production system. So I always try to restrict the possibility to use NULLs in the database and in the code unless null has a real business meaning.

When your SQL database is missing foreign keys

…then sooner or later, you’re going to have a bad time. Bugs in your app or users’ recklessness will cause your database to be inconsistent.

An example from my job: a system had users and users_categories tables. While registering their accounts, new users entered not only an e-mail address and a password, but also selected categories, like teacher, student, parent. The data were immediately inserted into a MySQL database. But the account had to be activated via e-mail. A script was executed every day to purge inactive accounts. It wiped records only from the users table, not users_categories. There were no foreign keys to block that behavior.

Every company has some erroneous legacy code here and there. I saw databases reaching 60-70 gigabytes in size, with hundreds of millions of records and not having foreign key constraints because someone… was afraid of them. A long time ago, one database in the company had ON DELETE CASCADE foreign keys, which means deleting one record caused cascaded deletions of related records. My colleague destroyed almost half of a test database, so he decided to remove foreign key constraints. That was a classic misunderstanding of the technology that we were using.

How a foreign key constraint works?

In a relational database, foreign keys are meant to secure the references between entities. For example, in a users_categories we have two relations: with users and categories tables. No row in the users_categories table cannot reference a non-existent user or category.

The only exception is if a row contains NULL values. This is a way to create optional relations.

Creating a foreign key usually looks like that (based on InnoDB engine in MySQL):

ALTER TABLE users_categories
ADD CONSTRAINT fk_users_categories_user_id
FOREIGN KEY (user_id) REFERENCES users (id)
ON DELETE RESTRICT

We create a constraint with a specified name (we don’t have to, but it’s good to have a name that we can later use during reverting a migration). We specify a column which should be restricted and a destination table and column which we want to refer to. In the end, we decide what happens if we try to delete related data. A default behavior is to restrict such query and issue an error, for example if we try to remove a user without removing his (her) categories list. We can also decide to have a cascade deletion or set NULLs, but I always choose the safest option which is to RESTRICT.

Foreign keys require that:

  • both columns (the one we create a constraint for and the one we refer to) must have exactly the same type; a SMALLINT column cannot refer to INT,
  • there must be an index in the source table for the column; it can be a multi-column index starting with that column; if no index is matching, a new one will be created
  • a constraint must have a unique name in the schema scope

Introducing foreign keys in an inconsistent database

If you have some existing data without foreign keys, you need to clean it up first. That’s a challenge because sooner or later, a database without foreign keys will be a mess. MySQL will not allow us to create a foreign key on wrong data – unless we issue a SET FOREIGN_KEY_CHECKS=0 query before adding constraints (that’s what mysqldump does by default). However, we would like to have proper data not only until now, but also to fix what we already have. We need to somehow untangle that existing spaghetti.

I decided to analyze existing data to know:

  • how many records with wrong references do I have
  • what strategy will be the best to fix these particular records
  • are there differences in column types

On every table I executed a query showing all records with wrong references:

SELECT * FROM users_categories WHERE NOT EXISTS (
SELECT * FROM users WHERE user_id = users.id
)

In the example above, if the user_id column would allow NULL values, we should add a following condition: user_id IS NOT NULL.

I noticed a funny trend between developers not using foreign keys: they define optional references as, for example, INT NOT NULL DEFAULT 0. This is not correct because usually, in referenced tables with AUTO_INCREMENT or SERIAL primary keys, there are no records with id = 0. So introducing a foreign key in this case will not work. I had to modify the table schema and then change all 0s to NULLs. Let’s say that a user might have an optional reference to a city:

ALTER TABLE users MODIFY city_id INT DEFAULT NULL;

UPDATE users SET city_id = NULL WHERE NOT EXISTS
(SELECT * FROM cities WHERE city_id = cities.id);

Here we can see a strategy of setting NULL every time we cannot find a destination entity. We don’t want to remove a user’s record just because it refers to a non-existent city. NULL means a value is unknown, uncertain (it’s not the same as an empty value or 0). If we have an incorrect city_id value, we cannot specify which exact city it refers to – so we set it to NULL.

Other strategy can be taken for many-to-many join tables. In the users_categories table I mentioned, if we have invalid user_id, category_id or even both – this means that the whole record can be removed:

DELETE uc FROM users_categories uc WHERE NOT EXISTS (
SELECT * FROM users u WHERE u.id = uc.user_id
) OR NOT EXISTS (
SELECT * FROM categories c WHERE c.id = uc.category_id
)

When you pick a certain data cleanup strategy, you need to know the business well. Maybe the wrong data is used in some reporting systems and if you suddenly fix the stats, it will confuse people because they relied on these data for a long time. They might perceive your ingenious fix… as a mistake 🙂

Fixing complex cases

Sometimes, relations between entities form a long chain: for example, a user puts and order which consists of several items, and every order item refers to a product, and every product… has been added by some admin user. At first you should check the most generic entities (like products) and then dig deeper into orders and order items. Take a look at the following steps to see which strategy I pick in different scenarios:

  1. Check if there are products added by non-existent users. An information about a removed user is not necessary for the system to work. A wrong user_id can be set to NULL.
  2. Check if there are orders with wrong user_id. An orders history is essential for the legal, tax and reporting purposes. We cannot remove any orders, so we just set user_id = NULL if a user does not exist (maybe he was not logged in?).
  3. Check if there are any order items related to non-existent orders or products. Items contain important invoicing data like net and gross price. But if a report requires a proper reference from orders_items to orders table, and we have queries like SELECT … FROM orders JOIN orders_items …, then a JOIN clause eliminates wrong records anyway. So we can remove items that refer to wrong orders. The same situation applies to product_id. We won’t prepare sale reports for products which do not exist.

Of course before executing such dangerous queries we need to have a backup. And while we analyze the details of the system we’re trying to clean up, we should consult as many people who have a business knowledge as possible.

I know it would be nice to have an automatic script to clean up all the data before adding constraints. But there is no silver bullet because every case is different. You need to use your creativity and intuition! It’s not easy, but in the end, satisfaction will be great.

Some interesting, further, external read: