The Art of SQL Report Writing

Part 4 of 5

An Example

Here we are going to walk through a real world example. In fact, this is exactly what we have done in updating our blog module for Kaya CMS which is used to power our web site and blog. This example doesn't show everything (that probably would require us to come up with some contrived fake example) but it will show the basic principles used when building good clean reports.

First let's go back and look at the ERD we showed in Part 2.



For this example, we want to list out all the blog posts in our blog. There is a small catch though. We are in the process of adding the table "blog_series" which will combine multiple blog posts into a single series of posts in a longer article. For the posts in a series, we want to only show the first post in that series.

What we first have to be careful of is to combine ALL blog posts that have no series (that foreign key is nullable) with only the first post for any series.

We start by looking at the report that we will need to create. It will list out the blog ID, blog name, blog series ID (if applicable), blog series name (if applicable), blog post ID (first in series), blog post name (first in series), list of blog topics the post is assigned to.

Blog ID Blog Name Blog Series ID Blog Series Name Blog Post ID Blog Post Name Blog Topics
1 Code Elixir 1 Usability - The Opportunities Everyone Has Missed General Topics
1 Code Elixir 1 Software Start Up on a Tight Budget 2 Introduction Business
1 Code Elixir 2 The Art of SQL Report Writing 3 Introduction Database, Software Engineering


I really wanted to include in the report the number of posts in each series. I tried this at first and found it to over complicate this example and so discarded it from the example. In reality I will actually take this example and then expand upon it to include the post count per series AFTER this query is written. When a report becomes too complex, you must break it down into simpler parts and then build the query up to what you really want as you get individual parts to work.

I will leave it as an exercise for you readers to determine how to add the post count in a series. There are a number of ways to accomplish this. However, making it efficient is the hard part. It would be tempting to use correlated sub-queries or add a sub-select in the list or returned values (in the column list of the select). Both of these would perform very badly in a system with a large amount of data and I would avoid them.


Now that we know WHAT we are trying to produce, we need to highlight the primary tables in our ERD. I have highlighted the primary tables in green.



Then highlight the foreign key path between the primary tables. I have highlighted the path in dark blue.



Next, we highlight any secondary tables. I have highlighted the one secondary table for our example in blue.



Now we find the focal point of the report. While it may be tempting to select the blog series since we are folding the posts to have a series only appear once in the list, this is actually not a good choice. This is because some lines in the report have no series at all. Each row in the report must map to a record in our focal point table. For this reason, we select the blog post. Blog will appear in multiple rows, as will blog topic. In fact, blog topic will have one row in the report map to multiple records in the database which is also not desirable in the focal table.

Now we need to make sure that our paths radiate outward from the focal table. This is not the case. Here we have two potentials for problems.

The first problem is that we have blog_post mapping to blog_series, then blog and back around to blog_post. This is easy to resolve. We simple will join blog_post to blog and blog_post to blog_series. We will thus drop the path from blog_series to blog. It is possible, from simply a database structure perspective, that blog_series and blog_post could map to different blogs. We will leave it up to the user interface to enforce that does not happen. If we are lucky, the database developer may have even put in a trigger that enforces that restriction too.

Then next problem is that blog_post goes down to blog_post_blog_topic (that is OK) and then that goes up to blog_topic (this is not good). To resolve this I am going to collapse blog_topic and blog_post_blog_topic into a sub-query where we have, in this sub-query, the blog_post_blog_topic table is our focal point. This will help us collapse these two tables into a one row in the sub-query maps to a single row in the blog_post table thus eliminating the blog_topic and blog_post_blog_topic from our outer query. Sub-queries are a great way to eliminate these problematic tables that violate our "radiate outward" rule.

NOTE: I will assume you have some aggregate function that joins multiple values into a delimited string. Each database will have different implementations of this and in some databases, you may have to write your own function to achieve this result.

Here I will assume you have a function like join(column_name, delimiter) so that join(blog_topic_name, ', ') will combine "General" and "Database" into "General, Database".


SELECT
    blog_post_blog_topic.blog_post_id,
    join(blog_topic.blog_topic_name, ', ')
FROM blog_post_blog_topic
INNER JOIN blog_topic ON (
    blog_topic.blog_topic_id = blog_post_blog_topic.blog_topic_id
)
GROUP BY blog_post_blog_topic.blog_post_id


NOTE: I will leave it as an exercise for you to walk through the steps in creating a report to create the above sub-query. If you use blog_post_blog_topic as your focal table and only have blog_post_blog_topic and blog_topic as your primary tables this should be pretty straight forward to do.


Now we are ready to start building our query. Let's start by querying our focal table.

SELECT
    ____
FROM blog_post
;


Next join on the tables above the focal table.

SELECT
    ____
FROM blog_post
-- Joins up the ERD
INNER JOIN blog ON (blog.blog_id = blog_post.blog_id)
LEFT OUTER JOIN blog_series ON (blog_series.blog_series_id = blog.blog_series_id)
;


Now join on the tables below our focal table. This is where we use the sub-query instead of the actual table it joins to.

SELECT
    ____
FROM blog_post
-- Joins up the ERD
INNER JOIN blog ON (blog.blog_id = blog_post.blog_id)
LEFT OUTER JOIN blog_series ON (blog_series.blog_series_id = blog.blog_series_id)
-- Joins down the ERD
LEFT OUTER JOIN (
    SELECT
        blog_post_blog_topic.blog_post_id AS blog_post_id,
        join(blog_topic.blog_topic_name, ', ') AS blog_topics
    FROM blog_post_blog_topic
    INNER JOIN blog_topic ON (
        blog_topic.blog_topic_id = blog_post_blog_topic.blog_topic_id
    )
) AS blog_topic_list ON (
    blog_topic_list.blog_post_id = blog_post.blog_post_id
)
;


Now we add our list of columns.

SELECT
    blog.blog_id,
    blog.blog_name,
    blog_series.blog_series_id,
    blog_series.blog_series_name,
    blog_post.blog_post_id,
    blog_post.blog_post_name,
    blog_topic_list.blog_topics
FROM blog_post
-- Joins up the ERD
INNER JOIN blog ON (blog.blog_id = blog_post.blog_id)
LEFT OUTER JOIN blog_series ON (blog_series.blog_series_id = blog.blog_series_id)
-- Joins down the ERD
LEFT OUTER JOIN (
    SELECT
        blog_post_blog_topic.blog_post_id AS blog_post_id,
        join(blog_topic.blog_topic_name, ', ') AS blog_topics
    FROM blog_post_blog_topic
    INNER JOIN blog_topic ON (
        blog_topic.blog_topic_id = blog_post_blog_topic.blog_topic_id
    )
    GROUP BY blog_post_blog_topic.blog_post_id
) AS blog_topic_list ON (
    blog_topic_list.blog_post_id = blog_post.blog_post_id
)
;


That was easy enough. Our guide in Part 3 said we need to next add our GROUP BY clause. Since we already collapsed the only tables below blog_post (our focal table) into one record per record in blog_post this is not necessary.

The last step is to add our filtering. So far our query is pulling out the data we need and joining the tables cleanly. The problem is we are getting all blog posts in a series and we only want the first one.

Let's set this query aside and try to determine how to get only the blog_post records that either have no series or are the first in the series.

Getting blog_post records that have no series is easy.

SELECT *
FROM blog_post
WHERE blog_series_id IS NULL


Getting the first blog_post in a single series is a little more complex. I am going to show you how to do it. This query shows you how to get the first record in any group of records and I use it a lot for other types of queries that need to do it. I suggest tucking this away in your arsenal of useful SQL techniques (kind of a nice cut and paste SQL). Note that here I am assuming no two blog_post records can have the same blog_post_dt in a series. Hopefully the UI and / or the database will enforce this. If not, getting a single blog_post per series may have to be done by create_dt or its blog_post_id.

SELECT *
FROM blog_post
INNER JOIN (
    SELECT
        blog_series_id,
        min(blog_post_dt) AS min_blog_post_dt
    FROM blog_post
    GROUP BY blog_series_id
) AS blog_posts_to_keep ON (
    blog_post.blog_series_id = blog_posts_to_keep.blog_series_id
    AND
    blog_post.blog_post_dt = blog_posts_to_keep.min_blog_post_dt
)


Now we combine these two to get a list of blog_post records that either have no series or are the first in the series.

SELECT *
FROM blog_post
WHERE blog_series_id IS NULL
UNION
SELECT *
FROM blog_post
INNER JOIN (
    SELECT
        blog_series_id,
        min(blog_post_dt) AS min_blog_post_dt
    FROM blog_post
    GROUP BY blog_series_id
) AS blog_posts_to_keep ON (
    blog_post.blog_series_id = blog_posts_to_keep.blog_series_id
    AND
    blog_post.blog_post_dt = blog_posts_to_keep.min_blog_post_dt
)


What we want to do with this new query is to use it to filter out the blog_post records. So we will change the * in the selects to blog_post_id and then filter by it in the WHERE clause.

SELECT
    blog.blog_id,
    blog.blog_name,
    blog_series.blog_series_id,
    blog_series.blog_series_name,
    blog_post.blog_post_id,
    blog_post.blog_post_name,
    blog_topic_list.blog_topics
FROM blog_post
-- Joins up the ERD
INNER JOIN blog ON (blog.blog_id = blog_post.blog_id)
LEFT OUTER JOIN blog_series ON (blog_series.blog_series_id = blog.blog_series_id)
-- Joins down the ERD
LEFT OUTER JOIN (
    SELECT
        blog_post_blog_topic.blog_post_id AS blog_post_id,
        join(blog_topic.blog_topic_name, ', ') AS blog_topics
    FROM blog_post_blog_topic
    INNER JOIN blog_topic ON (
        blog_topic.blog_topic_id = blog_post_blog_topic.blog_topic_id
    )
    GROUP BY blog_post_blog_topic.blog_post_id
) AS blog_topic_list ON (
    blog_topic_list.blog_post_id = blog_post.blog_post_id
)
WHERE blog_post.blog_post_id IN (
    SELECT blog_post.blog_post_id
    FROM blog_post
    WHERE blog_series_id IS NULL
    
    UNION
    
    SELECT blog_post.blog_post_id
    FROM blog_post
    INNER JOIN (
        SELECT
            blog_series_id,
            min(blog_post_dt) AS min_blog_post_dt
        FROM blog_post
        GROUP BY blog_series_id
    ) AS blog_posts_to_keep ON (
        blog_post.blog_series_id = blog_posts_to_keep.blog_series_id
        AND
        blog_post.blog_post_dt = blog_posts_to_keep.min_blog_post_dt
    )
)
;


At this point if you are not experienced in SQL report writing your head may be spinning. Yes, I made this look simple but it isn't. In all honesty this report took me about an hour to work through and what I have showed you above wasn't the first, second or even third attempt at the query. But the steps in Part 3 helped keep me focused and really sped up the process. Not only that but the query itself is pretty easy to read (as complex queries like this go). That WHERE clause was a royal pain to come up with and my initial approaches had me trying things like replacing the blog_post in the FROM clause (that did NOT work well) and trying to play around with the blog_series by turning it into a sub-query. Once I was done I realized my difficulty came up when I forgot to follow my own advice!