Archive for July, 2013
Filling out timelines in SQL
In this post I’ll show a method to fill out timelines in SQL. We’ll define a timeline as a set of periods without gaps from ‘0001-01-01′ to ‘9999-12′. I was reading Filling out and slicing timelines in one or more tables. It also describes a pattern to fill the gaps in a timeline. I will describe a different pattern to fill the timeline.
Let’s pick one of the example tables used, the customer_shipping_address:
Create a table and insert records into it. I’m using PostgreSQL as database platform here and used pgModeler to create the table.
CREATE TABLE public.customer_shipping_address ( customer_id smallint , valid_from date , valid_to date , address varchar(255) , CONSTRAINT pk_customer_s PRIMARY KEY (customer_id,valid_from) ); INSERT INTO public.customer_shipping_address ( customer_id , valid_from , valid_to , address ) VALUES ( 1 , TO_DATE('2004-01-01','YYYY-MM-DD'), TO_DATE('2005-01-01','YYYY-MM-DD'), 'West Avenue 1'), ( 1 , TO_DATE('2005-01-01','YYYY-MM-DD'), TO_DATE('9999-12-31','YYYY-MM-DD'), 'The Docks 4'), ( 2 , TO_DATE('2004-05-01','YYYY-MM-DD'), TO_DATE('2005-09-01','YYYY-MM-DD'), 'East Avenue 2'), ( 2 , TO_DATE('2006-01-11','YYYY-MM-DD'), TO_DATE('2006-05-01','YYYY-MM-DD'), 'Arlington Road 7');
It has several gaps in the timeline. The timeline of customer with id 2 does not start at ‘0001-01-01′ and it does not end at ‘9999-12-31′. The two periods defined for customer id 2 also have a gap between them.
The pattern I’m going to show uses three queries in a union all construction to fill the timeline. The fist one is easy, just get the all the records.
-- get the original timelines select customer_id , address , valid_from , valid_to , false as filled from customer_shipping_address
The second part is a query that will fill the timeline from ‘0001-01-01′ to the first valid_from date we can find for a customer id.
-- fill the first part of the timeline -- 0001-01-01 -> first valid_from select customer_id , null as address , cast('0001-01-01' as date) , min(valid_from) , true as filled from customer_shipping_address group by customer_id -- if the first valid_from is not equal to 0001-01-01 we have to fill the first part of the timeline having min(valid_from) <> cast('0001-01-01' as date)
The last query detect if there is a gap after a period and fill it.
-- fill the gaps that occur after a period, including -- last valid-to -> 9999-12-31 select customer_id , null as address , valid_to as valid_from , expected_valid_to as valid_to , true as filled from (select customer_id , valid_from , valid_to , lead(valid_from,1,cast('9999-12-31' as date)) over(partition by customer_id order by valid_from) expected_valid_to from customer_shipping_address ) check_gaps -- if the expected valid_to is not equal to the real valid_to we have a gap to fill where expected_valid_to <> valid_to
Now we’ll combine the three queries into one with a union all construction and order the result by customer_id and valid_from to get a nice ordered result set.
-- get the original timelines select customer_id , address , valid_from , valid_to , false as filled from customer_shipping_address union all -- fill the first part of the timeline -- 0001-01-01 -> first valid_from select customer_id , null as address , cast('0001-01-01' as date) , min(valid_from) , true as filled from customer_shipping_address group by customer_id -- if the first valid_from is not equal to 0001-01-01 we have to fill the first part of the timeline having min(valid_from) <> cast('0001-01-01' as date) union all -- fill the gaps that occur after a period, including -- last valid-to -> 9999-12-31 select customer_id , null as address , valid_to as valid_from , expected_valid_to as valid_to , true as filled from (select customer_id , valid_from , valid_to , lead(valid_from,1,cast('9999-12-31' as date)) over(partition by customer_id order by valid_from) expected_valid_to from customer_shipping_address ) check_gaps -- if the expected valid_to is not equal to the real valid_to we have a gap to fill where expected_valid_to <> valid_to -- order the periods in the timelines order by 1,3
Now let’s check the result and see what happend.
customer_id | address | valid_from | valid_to | filled |
---|---|---|---|---|
1 | January 1, 0001 | January 1, 2004 | true | |
1 | West Avenue 1 | January 1, 2004 | January 1, 2005 | false |
1 | The Docks 4 | January 1, 2005 | December 31, 9999 | false |
2 | January 1, 0001 | May 1, 2004 | true | |
2 | East Avenue 2 | May 1, 2004 | September 1, 2005 | false |
2 | September 1, 2005 | January 11, 2006 | true | |
2 | Arlington Road 7 | January 11, 2006 | May 1, 2006 | false |
2 | May 1, 2006 | December 31, 9999 | true |
Cool all the gaps are filled and we have a complete timeline that makes joining over timelines easier
I’ve created the query as a view in the database, os now I can reuse the query. The HTML export was made via DBeaver a nice universal query tool.
Update: I’ve had a discussion on Linkedin in the Temporal Data group. We came up with a variant of my query that does not use a union all but a 3 record table to generate the same result, but only with one scan of the table:
select customer_id , case when rec_type = 1 then address else null end as address , case rec_type when 1 then valid_from when 2 then valid_to else cast('0001-01-01' as date) end as valid_from , case rec_type when 1 then valid_to when 2 then next_valid_from else min_valid_from end as valid_to , case when rec_type = 1 then false else true end as filled from (select customer_id , address , valid_from , valid_to , lead(valid_from,1,cast('9999-12-31' as date)) over(partition by customer_id order by valid_from) next_valid_from , min(valid_from) over(partition by customer_id) min_valid_from from customer_shipping_address ) check_gaps -- rec_type 1 is the original table -- rec_type 2 fills the gaps after the periods -- rec_type 3 fills the first part of the timeline cross join generate_series(1,3) as three_record_table(rec_type) -- if the expected valid_to is not equal to the real valid_to we have a gap to fill where (rec_type = 1) or (rec_type = 2 and valid_to < next_valid_from) or (rec_type = 3 and valid_from = min_valid_from and min_valid_from <> cast('0001-01-01' as date)) -- order the periods in the timelines order by 1,3