A Journey Down Recursive CTEs : dliff's blog

Check this post out on the Arctype blog!

We are fortunate to be able to experience the joys of being confronted with and solving problems with software. Often we leverage features of tools like SQL engines to help us implement our higher level solution. This is the story of a problem whose solution was a perfect fit for a specific SQL tool.

Postgres' common table expressions, or CTEs, can be a useful structuring tool to craft a SQL query in a more readable and friendly way. They begin as a with statement and allow you execute a query whose results will be available later to subsequent query. If you have a query that relies on the results of another query, a reference of sorts, the CTE will allow Postgres to materialize the results of the reference query, ensuring that the query is only run once and its results are readily available to other queries farther down the line. (Note that this can be a double-edged sword, as it prevents the Postgres query planner from running optimization outside of the reference query’s SQL.)

A CTE used in this scenario can replace an ungainly subquery, derived table, or temporary table. Repeatedly nested subqueries or derived tables can be more difficult to read (especially as their depth increases) versus a sequential list of blocks whose dependencies are always above themselves.

Derived Table

select
  a.name,
  a.cost
  from (
    select
      p.name,
      max(p.cost) as cost
      from
          part p
          group by p.name
  ) a

Temporary Table

This will require two separate queries.

create temporary table a as
select
  p.name,
  max(p.cost) as cost
  from
  part p
  group by p.name;

select
  a.name,
  a.cost
  from a;

CTE

with a as (
  select
    p.name,
    max(p.cost) as cost
    from
        part p
   group by p.name)
select
  a.name,
  a.cost
  from a;

Most people wouldn't consider the CTE to be more readable than the derived table in this example, but as a query gets more and more complex, organization becomes key to reducing errors. Note that the query depends on a, which is listed first with the CTE, but as a derived table it is listed inside the query. CTEs allow you to logically order queries that are dependent on others.

Recursive CTEs are a lot more interesting! Their function exceeds simple syntactic sugar. Once you add the recursive keyword, Postgres will now allow you to reference your CTE from within itself! This allows you to “generate” rows based on recursion.

Here’s an example using Postgres CTEs to implement the Fibonacci sequence:

with recursive fib(a, b) as (
  values (0, 1)
  union all
  select b, a + b from fib
)
select a as fib from fib limit 10;

Running this, we’d get the first 10:

It's probably not too often we find ourselves in a situation where we need Fibonacci numbers in SQL. Let’s continue on and take a look at a real problem I solved using this Postgres feature. In the manufacturing industry, Wikipedia defines a Bill of materials (or BoM) as a

list of the raw materials, sub-assemblies, intermediate assemblies, sub-components, parts, and the quantities of each needed to manufacture an end product … BOMs are of hierarchical nature, with the top level representing the finished product which may be a sub-assembly or a completed item. BOMs that describe the sub-assemblies are referred to as modular BOMs

More simply, a BoM is a list of parts that a product is assembled from. Each part may consist of other parts, creating hierarchy. A car is a final product that is composed of many parts. A door is an example of a part of a car, its subparts being among a window, handle, various switches, and so forth. Recursive CTEs are great for hierarchical data! Let’s take a look at how we could model this hierarchy and use CTEs to generate a modular BoM sheet.

part table stores our parts.

create table part (
 id serial primary key,
 name varchar,
 cost numeric
);

part_hierarchy table stores a simple parent to child reference between parts, as well as a quantity if the part happens to need more than 1 of the child part for assembly. The check constraints prevent a part from referencing itself circularly.

create table part_hierarchy (
  parent_id int not null references part(id) check (child_id != parent_id),
  child_id int not null references part(id) check (child_id != parent_id),
  quantity int not null default 1
);

Let's insert some parts

insert into part (name, cost) values
('Car', null),
('Engine', 1000.0),
('Door', null),
('Handle', 10.0),
('Window', 5.0);

And some relations, approximately resembling the car example mentioned above

insert into part_hierarchy (parent_id, child_id, quantity)
values
(1, 2, 1), -- Car needs engine
(1, 3, 1), -- Car needs door
(3, 4, 2), -- Car needs 2 handles
(3, 5, 1); -- Car needs a windows

In our database we now have 5 parts and some relationships between them. If we wanted to see how many direct subassemblies a door has, we could query something like this, though it would only show us direct descendents:

select count(1) from part_hierarchy where parent_id = (select id from part where name='Door' limit 1);

Finally, let's now construct a query to list all of the parts and their subassemblies of our car.

with recursive sub_parts as (
  select
    p.id as part_id,
    p.name as part_name,
    1 as quantity,
    0 as "level",
    p.cost
    from part p
   where
       p.id = 1
  union all
  select
    p.id as part_id,
    p.name as part_name,
    ph.quantity,
    sp.level + 1,
    p.cost
    from
        sub_parts sp
        join part_hierarchy ph on ph.parent_id = sp.part_id
        join part p on p.id = ph.child_id
)
select
  lpad(part_name, "level" + length(part_name), ' ') as name,
  "level",
  cost from sub_parts;

The union all is the key part to this CTE. The first part of this UNION selects the base part, the second half selects any child parts by joining to the part_hierarchy table and calling itself (inducing recursion), then combines those results with the parent via the union. Without recursion, you'd have to add a join for every level of the hierarchy you needed to support.

The level value is introduced here to track how deep the recursion has run. It is hard coded at 0 to represent the parent. As the query "unwraps" and it calls itself, this increases by 1 to indicate how many generations far from the parent part the child part is.

The very last part that selects from the CTE pads the name with spaces based on the level, which is a quick and dirty way to visualize the hierarchy.

   name   | level |  cost   
----------+-------+---------
 Car      |     0 |        
  Engine  |     1 | 1000.00
  Door    |     1 |        
   Handle |     2 |   10.00
   Window |     2 |    5.00

From here, you could use aggregate functions and rollup to calculate the costs of specific assemblies.

This example is definitely simplified from what I ended up using to actually solve this problem. Supporting part versioning and branching and other business requirements quickly complicates what this solution would look like in the real world. Recursive CTEs in SQL are not the only way to store and query hierarchical data, but for certain data types, sizes, and speed requirements, they are a quick and convenient feature to have in your wheelhouse.