De los Tilos sabiduría

bases += patrones += relaciones += datos

Archive for July, 2013

Filling out timelines in SQL

without comments

In this post I’ll show a method to fill out timelines in SQL. We’ll define a timeline as a set of periods without gaps from ‘0001-01-01′ to ‘9999-12′. I was reading Filling out and slicing timelines in one or more tables. It also describes a pattern to fill the gaps in a timeline. I will describe a different pattern to fill the timeline.
Let’s pick one of the example tables used, the customer_shipping_address:

Create a table and insert records into it. I’m using PostgreSQL as database platform here and used pgModeler to create the table.

CREATE TABLE public.customer_shipping_address
( customer_id smallint
, valid_from date
, valid_to date
, address varchar(255)
, CONSTRAINT pk_customer_s PRIMARY KEY (customer_id,valid_from)
);

INSERT INTO public.customer_shipping_address
( customer_id , valid_from                        , valid_to                          , address ) VALUES
( 1           , TO_DATE('2004-01-01','YYYY-MM-DD'), TO_DATE('2005-01-01','YYYY-MM-DD'), 'West Avenue 1'),
( 1           , TO_DATE('2005-01-01','YYYY-MM-DD'), TO_DATE('9999-12-31','YYYY-MM-DD'), 'The Docks 4'),
( 2           , TO_DATE('2004-05-01','YYYY-MM-DD'), TO_DATE('2005-09-01','YYYY-MM-DD'), 'East Avenue 2'),
( 2           , TO_DATE('2006-01-11','YYYY-MM-DD'), TO_DATE('2006-05-01','YYYY-MM-DD'), 'Arlington Road 7');

It has several gaps in the timeline. The timeline of customer with id 2 does not start at ‘0001-01-01′ and it does not end at ‘9999-12-31′. The two periods defined for customer id 2 also have a gap between them.
The pattern I’m going to show uses three queries in a union all construction to fill the timeline. The fist one is easy, just get the all the records.

-- get the original timelines
select customer_id
     , address
     , valid_from
     , valid_to
     , false as filled
  from customer_shipping_address

The second part is a query that will fill the timeline from ‘0001-01-01′ to the first valid_from date we can find for a customer id.

-- fill the first part of the timeline
-- 0001-01-01 -> first valid_from
select customer_id
     , null as address
     , cast('0001-01-01' as date)
     , min(valid_from)
     , true as filled
  from customer_shipping_address
 group
    by customer_id
    -- if the first valid_from is not equal to 0001-01-01 we have to fill the first part of the timeline
having min(valid_from) <> cast('0001-01-01' as date)

The last query detect if there is a gap after a period and fill it.

-- fill the gaps that occur after a period, including
-- last valid-to -> 9999-12-31
select customer_id
     , null as address
     , valid_to as valid_from
     , expected_valid_to as valid_to
     , true as filled
  from (select customer_id
             , valid_from
             , valid_to
             , lead(valid_from,1,cast('9999-12-31' as date))
               over(partition by customer_id
                        order by valid_from) expected_valid_to
          from customer_shipping_address
       ) check_gaps
    -- if the expected valid_to is not equal to the real valid_to we have a gap to fill
 where expected_valid_to <> valid_to

Now we’ll combine the three queries into one with a union all construction and order the result by customer_id and valid_from to get a nice ordered result set.

-- get the original timelines
select customer_id
     , address
     , valid_from
     , valid_to
     , false as filled
  from customer_shipping_address
 union all
-- fill the first part of the timeline
-- 0001-01-01 -> first valid_from
select customer_id
     , null as address
     , cast('0001-01-01' as date)
     , min(valid_from)
     , true as filled
  from customer_shipping_address
 group
    by customer_id
    -- if the first valid_from is not equal to 0001-01-01 we have to fill the first part of the timeline
having min(valid_from) <> cast('0001-01-01' as date)
 union all
-- fill the gaps that occur after a period, including
-- last valid-to -> 9999-12-31
select customer_id
     , null as address
     , valid_to as valid_from
     , expected_valid_to as valid_to
     , true as filled
  from (select customer_id
             , valid_from
             , valid_to
             , lead(valid_from,1,cast('9999-12-31' as date))
               over(partition by customer_id
                        order by valid_from) expected_valid_to
          from customer_shipping_address
       ) check_gaps
    -- if the expected valid_to is not equal to the real valid_to we have a gap to fill
 where expected_valid_to <> valid_to
-- order the periods in the timelines
 order by 1,3

Now let’s check the result and see what happend.

customer_id address valid_from valid_to filled
1 January 1, 0001 January 1, 2004 true
1 West Avenue 1 January 1, 2004 January 1, 2005 false
1 The Docks 4 January 1, 2005 December 31, 9999 false
2 January 1, 0001 May 1, 2004 true
2 East Avenue 2 May 1, 2004 September 1, 2005 false
2 September 1, 2005 January 11, 2006 true
2 Arlington Road 7 January 11, 2006 May 1, 2006 false
2 May 1, 2006 December 31, 9999 true

Cool all the gaps are filled and we have a complete timeline that makes joining over timelines easier :)
I’ve created the query as a view in the database, os now I can reuse the query. The HTML export was made via DBeaver a nice universal query tool.

Update: I’ve had a discussion on Linkedin in the Temporal Data group. We came up with a variant of my query that does not use a union all but a 3 record table to generate the same result, but only with one scan of the table:

select customer_id
     , case when rec_type = 1 
            then address 
            else null 
       end as address
     , case rec_type when 1 then valid_from
                     when 2 then valid_to   
                     else cast('0001-01-01' as date) 
       end as valid_from
     , case rec_type when 1 then valid_to
                     when 2 then next_valid_from 
                     else min_valid_from 
       end as valid_to
     , case when rec_type = 1 
            then false 
            else true 
       end as filled
  from (select customer_id
             , address
             , valid_from
             , valid_to
             , lead(valid_from,1,cast('9999-12-31' as date))
               over(partition by customer_id
                        order by valid_from) next_valid_from             
             , min(valid_from)
               over(partition by customer_id) min_valid_from
          from customer_shipping_address 
       ) check_gaps
-- rec_type 1 is the original table
-- rec_type 2 fills the gaps after the periods
-- rec_type 3 fills the first part of the timeline
 cross 
  join generate_series(1,3) as three_record_table(rec_type)    
-- if the expected valid_to is not equal to the real valid_to we have a gap to fill
 where (rec_type = 1)
    or (rec_type = 2 and valid_to < next_valid_from)
    or (rec_type = 3 and valid_from = min_valid_from and min_valid_from <> cast('0001-01-01' as date))
   -- order the periods in the timelines
 order by 1,3 

Written by delostilos

July 2nd, 2013 at 12:05 am

Posted in SQL

Tagged with