All you need to know about Database Partitioning as a developer

ยท

6 min read

What is database partitioning?

Database partitioning means breaking your large data into smaller partitions to access the data quickly.

For example, we have a database with 1 million rows. Row-id starting from 0 to a million. Suppose we need a row with id 700001. A naive approach would be to query the database as

`SELECT name FROM customers where id=700001`

This would query the database to search all over from 0 to 700001. This is NOT what we love because it's computationally expensive.

To avoid this, we create partitions. Let's say from 0 to 200k, 200k to 400k, 400k to 600k and so on.

What is horizontal and vertical partitioning?

  • Horizontal partitioning splits rows into partitions

    • Range or List
  • Vertical partitioning splits columns into partitions

    • Large column (blob) that you can store in a slow access drive in its own tablespace

What are the types of partitioning?

  • By Range

    • Dates, ids (e.g. by log date or customer id)
  • By List

    • Discrete Values (eg. states CA, AL, etc.) or zip codes
  • By Hash

    • Hash functions (consistent hashing), ps: used by Cassandra

Horizontal Partitioning vs Sharding

  • Horizontal Partitioning splits big tables into multiple tables in the same database and the client is agnostic.

  • Sharding splits the big table into multiple tables on multiple DB servers.

  • Horizontal Partitioning changes table names (or schema)

  • In sharding everything is the same, just the server changes.

Demo of Horizontal Partitioning - Example with Postgres

  • For this example, let's spin up a Postgres instance with Docker, if you don't have Docker installed, you can install it from here.
docker run --name pgmain -d -e POSTGRES_PASSWORD=postgres postgres
docker exec -it pgmain bash
psql -U postgres

now you will see a prompt like this,

postgres=#

here we can query our Postgres database.

create table grades_org ( id serial not null, g int not null);

This query will create a grades original table (grades_org). This will be the original table of which we want to make partitions.

insert into grades_org(g) select floor(random()*100) from generate_series (0, 10000000);

This insert query will insert generate a million random grades and insert them into the table. The query might take some time based on your system, so be patient.

create index grades_org_index on grades_org(g);

After we have successfully inserted a million rows, let's index this table so we can query the database faster.

\d grades_org;

Let's describe the table using \d

explain analyze select count(*) from grades_org where g = 30;

This query will count all records which have a grade(g) = 30 and explain and analyze the query for you. Note the time taken for this query to execute.

explain analyze select count(*) from grades_org where g between 30 and 35;

This query will count all records which have grades (g) ranging from 30 to 35 and explain and analyze the query for you. Notice that this query is a bit different from the previous query. Also, note the time taken for this query to execute.

Now, we have seen how querying a table of million rows with index works.

Let's try to partition this table.

To do so, let's create new table grades_parts. We will keep the old table aside for a while.

create table grades_parts (id serial not null, g int not null) partition by range(g);

Now we are creating partitions by range, as we mentioned in the query. We have to create partitions for our data, ourselves.

Here we create 4 partitions :

g0035 - for grades 0 to 35,

g3560 - for grades 35 to 60,

g6080 - for grades 60 to 80,

and g80100 - for grades 80 to 100.

create table g0035 (like grades_parts including indexes); 
create table g3560 (like grades_parts including indexes);
create table g6080 (like grades_parts including indexes);
create table g80100 (like grades_parts including indexes);

Note that, we do not have indexes for any of these newly created tables, but still we have mentioned using indexes, we will create the index later on. Thats fine.

Going ahead,

we have created 4 partitions, let's attach those partitions to our leader table i.e. grades_parts.

alter table grades_parts attach partition g0035 for values from (0) to (35);
alter table grades_parts attach partition g3560 for values from (35) to (60);
alter table grades_parts attach partition g6080 for values from (60) to (80);
alter table grades_parts attach partition g80100 for values from (80) to (100);

Remember guys these tables are empty! and they do not have indexes.

Let's describe one of the partitions to get some info.

\d g80100;
insert into grades_parts select * from grades_org;

The above query will copy all rows from the original table(grades_org) to grades_parts, for each row, based on the value of g it will insert a row into its corresponding partition, for eg, if g is 30 then it will locate the partition 0 to 35 and put the row into that partition and so on.

select count(*) from grades_parts;

Check the count of rows, to see if the data has been into the grades_parts table or not.

select max(g) from grades_parts;
select max(g) from g0035;
select count(*) from g0035;
select max(g) from g6080;

Now, as we know, we don't have indexes for any leader table as well as its partition,

Let's now create an index on the leader table (i.e. grades_parts) and Postgres will create similar indexes on its partitions as well.

create index grades_parts_idx on grades_parts(g);

Execute the below query and see the time taken for each query, the time taken for grades_parts would be slightly less or maybe not because we are operating on a small amount of data and we are using docker. Docker is allowed to use as many resources as it requires, we don't have a limit for that.

But on the actual server, where memory is a constraint, partitions will show performance.

explain analyze select count(*) from grades_org where g = 30;
explain analyze select count(*) from grades_parts where g = 30;

Now you can see the database has used an index-only partition to scan the database.

select pg_relation_size(oid), relname from pg_class order by pg_relation_size(oid) desc;

The above query gives the size of each table in the Postgres database, note the size of partitions and how we save the database from querying a million rows.

show ENABLE_PARTITION_PRUNING;

A quick tip: Make sure this is enabled or else partitioning is useless.

If partition pruning is off, the database will query all the indexes ๐Ÿ˜•.

Pros of Partitioning

  • improves query performance when accessing a single partition

  • sequential scan vs scattered index scan

  • easy bulk loading (attach partition)

  • archive old data that are barely accessed into cheap storage

Cons of Partitioning

  • updates that move rows from one partition to another are slow or might fail sometimes

  • Inefficient queries could accidentally scan all partitions resulting in slower performance

  • Schema changes can be challenging (DBMS could manage it though)

Summary

So, is it a perfect solution? No. Software engineering is all about tradeoffs. But partitioning can be a powerful weapon if used effectively.

Reference: Database Partitioning Crash Course

Thank you, don't forget to leave a ๐Ÿ‘ if you liked it.

ย