Scaling a Saas Backend System with Shards
Optify Digital Marketing Software collects clickstream data for our clients, digital marketing agencies and B2B marketers. While it is easy for our clients to use Optify’s software, managing the collected data gets increasingly harder as time goes by. Admittedly, this is a good problem to have (because customers love Optify and keep the data flowing) and it is fun to scale the system to give our clients a responsive application.
Back in the year 2010, our database backend server’s performance was slowly degrading due to the data growth. While it is simple to buy a bigger server, we couldn’t keep throwing money at the datacenter; and we might have maxed out the server configuration. So instead of scaling the one server, we decided to scale horizontally.
“Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into columns (which is what normalization and vertical partitioning do, to differing extents). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.” — Wikipedia
The idea of horizontal partitioning is well documented; I won’t bore you with the details in this post. However I will share a few strategies we used to ensure a successful migration.
What to shard
Normally in a webapp, you have some data that grows slowly (e.g. list of users’ reports) and some that grows much faster (e.g. clickstream pageviews). Originally, all tables were stored in one database. We decided not to replicate all tables across all shards and only move the fast growing tables to the shard databases. This way, we wouldn’t need to replicate some core tables against many shard databases, and we only had to migrate a fraction of tables. Granted, then the application will need both the core and the shard databases to be up and running, but if the core database goes down we have a much bigger problem.
We decided to partition the data by the customers; i.e. a shard would hold data for one or more customers. We also decided that the shard assignment of a customer would be done dynamically. We use an admin tool to change a customer’s shard assignment if its data grows too fast. This way we wouldn’t have to restart any support backend, which makes the migration of customers a lot easier since we don’t need to coordinate any down time.
How many shards
Migration from one database to multiple ones means a lot of records are carried to a new destination, which is time consuming/error prone. To do the data split on the existing data once and hopefully never again, we migrated into 15+ shard databases initially. Multiple databases are hosted in one server to cut cost. Some day if we decide that one shard needs more processing power we would simply move that database to another server.
Primary key sequence manipulation
Data that lives in different shard databases might need to be reunited some day, i.e. to a business intelligence database for some serious data crunching. Since we have data from 15+ shard databases, we need to prevent records colliding on the primary key.
To do so, we manipulated the id sequence of tables to start at a particular number and then increment by 10000. For example, we might run this for a table in shard 13:
alter sequence visitor_id_seq restart 300000013 increment 10000;
Remove database joins
In the old world all tables were in one database. We were moving some tables to other databases. You cannot execute a single query against two different databases (in postgresql anyway), so naturally quite a few queries needed to be changed. We needed to locate them quickly, without necessarily moving the tables away first in the development environment. Luckily in postgresql we can change the permission of runtime users on the selected tables. When the software connected to the (core) database as the original user, it could not access the (to-be-sharded) tables because of the permission change, so the bad queries failed, and we could make changes.
revoke select, insert, update, delete on visitor from optify_user;
grant select, insert, update, delete on visitor from optify_shard_user;
The migration had a lot of hiccups (that’s for another blog post) but at the end, we have more database servers hosting smaller tables, and the performance of the Optify application improved at least 10 fold.