“Going Big” With a Website, Part 3: Databases that Scale
Editor’s Note: This week, we complete our guest series with Larry Ullman’s third installment in his discussion about scaling a website. If you’re just joining us, be sure to check out Part 1: Infrastructure that Scales and Part 2: Code that Scales.
As a web developer, writer, and public speaker, I often interact with people of various skill levels, talents, and interests. This is one of the joys of my career: I’m fortunate enough to bear witness to the thoughts and experiences of other programmers, “idea” people, and just plain dreamers.
One of the common topics that comes up, or that I am directly asked about, is how one “goes big” with a website. In this three-part series, I explain everything I believe and know (or think I know) when it comes to this subject. In Part 1, I covered the myths of going big, what infrastructure you’ll eventually need, and how one should start developing a new project. In Part 2, I discussed how one writes code that can scale well should your site “go big”. And here, in Part 3, I turn to designing databases that can handle the traffic of a “big” site.
Selecting a Database Application
The first step you’ll need to take (from a database perspective) when designing any new site is choosing the right database application. MySQL has been the de facto king in this arena for some time, at least for beginners and in the open source community. This is largely still the case, even though:
- PostgreSQL is a fine alternative
- There are variations on MySQL, such as MariaDB
- Many excellent non-relational applications exist, like MongoDB
- There are non-database storage solutions such as Hadoop
Of the many goals I have for this series, getting into a debate as to the best database application to use is not one of them. Rightly or wrongly, there are other factors that go into the decision, including with which applications you are familiar, have access to, and so forth. For the sake of the rest of this article, I’m going to focus solely on MySQL, for two reasons:
- It’s an extremely common choice.
- It’s known to have scalability issues that some of the others (e.g., MongoDB) deliberately do not.
Tuning and Indexing a Database
You’ll get the best performance out of your database if you tune the database application and make proper use of indexes in the database itself. Doing both correctly requires understanding in some detail how the database application works.
As for tuning, this may not be something you have the authority to do, unless you have a VPS or dedicated server. But it is something you should investigate, as the default settings for MySQL are too generic to be optimal for all cases. Working with a hosting company like ServInt that provides excellent customer service (and technical service) will be a tremendous asset in this area. Also, become friends with the MySQL manual (for your version of the database application) to see what options exist and what appropriate values should be. Use lots of benchmarks to confirm that you’re making the correct choices.
Looking at the database itself, performance is impacted greatly by your use of indexes. Simply put, creating the proper indexes for your tables improves the performance of your database (well, for SELECT queries anyway, which are the most important). To start, you should index columns that are:
- The primary key
- Frequently used in WHERE clauses
- Frequently used in ORDER BY clauses
- Frequently used as the basis for a JOIN
You should not index columns that:
- Allow NULL values
- Have a very limited range of values (such as Y/N or 1/0)
The other thing to keep in mind is how the index is defined. For example, if queries might sort by last name, you could index just the first few characters of the last name column. By doing so, you get the performance benefit of treating the column as if it were of fixed-length. You can also create indexes on more than one column at once. For example, you might create an index on the combination of the email address and password (for logging in purposes). If the index was created with the columns named in that order, you would also effectively be creating an index on the email address by itself, which could be beneficial. The same goes for indexing last_name, first_name: that creates an index applicable to ORDER BY last_name, first_name as well as one that applies to just ORDER BY last_name.
To confirm that you’ve got the right indexes for your database, use the EXPLAIN command to verify how well MySQL is using the indexes you’ve created for the queries your site executes.
A second factor for performance is how your columns are defined. Databases (and computers) work with numbers, particularly integers, faster than they do strings, so prefer numeric types whenever possible.
Changing Your Queries and Design
If you want to improve the performance of your database, also take a look at the code that’s using the database. Be absolutely certain to only retrieve (e.g., in PHP) the columns and rows you’ll absolutely use. There’s no point in selecting every column or every row if all of that data won’t be used. Also try to limit uses of:
These are among the most expensive queries you’ll run (in terms of time to process). This brings me to a debate you’ll need to have: normalization versus performance (in much the same way that your code has to strike a middle ground between being optimal and being flexible, discussed in Part 2).
In my books, I always argue for normalizing a database first and foremost. Normalization protects the integrity of your data, which is critical, but it does so at the cost of performance. Normalization often requires complex queries, like JOINs, UNIONs, subqueries, and groupings, to retrieve the information needed. These complex queries are the hardest for the database to execute, and therefore take the most time. For most sites, the extra effort (and the performance hit) is not a problem. But when sites get big, most sites drop the single complex query for multiple, faster queries.
On a less busy site, the act of running a query will be the most time consuming, so one more complex, slower query is better. On a very busy site, especially when you have multiple database servers, individual queries can be executed faster. Therefore, breaking complex queries into smaller, faster queries, is preferred.
The fact is that as sites get bigger, relational databases are used less and less as originally intended (i.e., as relational). The switch that many active sites are making to non-relational (aka NoSQL or non-SQL) databases isn’t so much a change from one database type to another as it is a change from using a relational database in a non-relational manner to using an actual non-relational database.
A similar trick that big sites use to improve performance is to break up what could be single tables into multiple. For example, you might have a users table that stores everything about a user. Each time a user logs in, or any time any query references that table, any extraneous information stored in the table – the user’s address, gender, whether the user likes watermelon or not, or whatever – impacts the performance of that query, even when that information is not referenced by the query itself. The fix in such cases is to put the bare minimum required information in the primary users table (e.g., username, email address, password, user ID, and maybe registration date), and move everything else to another table. Again, multiple queries will be required to retrieve all the information, but multiple smaller queries may execute faster than single larger queries. You’ll also find that all queries will run better if large text and binary columns are moved to their own separate tables. If you implement this approach properly, the first table could have SELECT queries performed using any column (e.g., the email address or the username), but the secondary tables would always use SELECT queries off of foreign keys: the FK related to the PK from the original table.
Finally, the best way to optimize your database and database interactions is to not query the database at all. This can be accomplished by making smart use of caching.
Those are my recommendations for tuning, indexing, and designing a database that can handle “going big”. If you do these things poorly, you’ll need to throw more resources at your database application earlier than you should. But if you do these things well, you’ll be able to handle a fair amount of traffic for quite some time, and only have to switch to more hardware when it’s absolutely necessary (thereby saving you money). If your site really takes off, you’ll have to be prepared to later change the database design to less-relational structure, in order to keep up with more exponential growth. That, though, is a good problem to have.
I hope you’ve enjoyed this series on “going big”, and maybe even learned something new. Most importantly, I hope I’ve been able to clear up some of the misunderstandings about “going big” with a website, and how you should, and shouldn’t, allocate your precious resources to pull that off.