The Art of SQL Report Writing

Part 5 of 5

Query Optimization

While the topic of query optimization could fill a book and is as much of an art as it is a science, I did want to finish up some pointers to help you get your report running quickly.

Every database is different when getting your queries to run quickly. There are, however some more or less universal ways to first determine what the problem is and then fix it. I will cover a number of the most common approaches to solving your performance problems here.

Also, never perform the any of the below on a production database that is currently being used unless absolutely necessary. Do your optimization on a development or test database, preferably one that mirrors your production environment.

Explain Plans

The first thing you should do when addressing performance issues on your database is to run an explain plan. The exact syntax for your database may vary but usually you just add the word "explain" in front of your query. This will give you a dump of how the query optimizer in your database will execute the query. You can also put an "explain analyze" in front on some databases. Using "analyze" will not only run the explain plan but will actually run the whole query on the database. If your query is taking hours to run you will NOT want to do this because you will have to actually wait hours to get the report. I generally start with an "explain" and if that is not enough, add the "analyze".

EXPLAIN ANALYZE
SELECT
    blog.blog_id,
    blog.blog_name,
    blog_series.blog_series_id,
    blog_series.blog_series_name,
    blog_post.blog_post_id,
    blog_post.blog_post_name,
    blog_topic_list.blog_topics
FROM blog_post
-- Joins up the ERD
INNER JOIN blog ON (blog.blog_id = blog_post.blog_id)
LEFT OUTER JOIN blog_series ON (blog_series.blog_series_id = blog.blog_series_id)
-- Joins down the ERD
LEFT OUTER JOIN (
    SELECT
        blog_post_blog_topic.blog_post_id AS blog_post_id,
        join(blog_topic.blog_topic_name, ', ') AS blog_topics
    FROM blog_post_blog_topic
    INNER JOIN blog_topic ON (
        blog_topic.blog_topic_id = blog_post_blog_topic.blog_topic_id
    )
    GROUP BY blog_post_blog_topic.blog_post_id
) AS blog_topic_list ON (
    blog_topic_list.blog_post_id = blog_post.blog_post_id
)
WHERE blog_post.blog_post_id IN (
    SELECT blog_post.blog_post_id
    FROM blog_post
    WHERE blog_series_id IS NULL
    
    UNION
    
    SELECT blog_post.blog_post_id
    FROM blog_post
    INNER JOIN (
        SELECT
            blog_series_id,
            min(blog_post_dt) AS min_blog_post_dt
        FROM blog_post
        GROUP BY blog_series_id
    ) AS blog_posts_to_keep ON (
        blog_post.blog_series_id = blog_posts_to_keep.blog_series_id
        AND
        blog_post.blog_post_dt = blog_posts_to_keep.min_blog_post_dt
    )
)
;


The output from this varies by database. However, you will want to look for the "cost" in your report. Look for places where the cost suddenly jumps from a low value to a high value. This will tell you where in the run of your query the database is having issues.

Also, your explain plan will typically change as the amount of data in your database grows. When you are testing your query for what will happen in the future, you will need to add test data to the database so your explain plan will reflect what will happen as your database grows.

You can also run the explain plan on sub-queries. I usually resort to this if the explain plan is rather large and I am having difficulty optimizing the whole thing. This can be a rabbit trail however. If your query will require a major restructuring, you may have to discard your sub-queries for more optimized ones or even merge their function with other sub-queries, joins or filtering in the query.

Indexing

The first thing we tackle when having performance concerns is to start indexing columns. Your explain plan will show you where you need indexes. The most common places to add an index is on columns that are in a WHERE clause, in an ORDER BY, in a GROUP BY or in a join. You should always index columns that are used for sorting. Even after indexing a column, check your explain plan and make sure the query optimizer is using your index.

Temp or Cache Tables

If you find that your performance problems are on derived column values, you may need to break your report up into mini reports that do a "SELECT .... INTO ...." which run a piece of your query and pump the data into a temporary or cache table. Then you can add indexes on the temporary or cache table so your subsequent queries will report off the derived values after they have been properly indexed. We have taken reports that take an hour or more to run, broken them down into a series of several cache tables and brought the run time down to seconds.

Reporting Database

If you find that your reports do not need to display real time data, you can easily create a reporting database on a separate server that runs off nightly, weekly or monthly backups. Most management reports for large businesses work well this way. It is rare for managers to need real time / live data. If it is less than 24 hours old it is good enough to make business decisions.

An offshoot of using a reporting database would be to run the reports themselves nightly and save off the results. This way when someone wants to see the report, they don't actually query the database. Managers love this as their report comes up instantly no matter what. It also allows you to monitor performance more predictably as you simple check how long the nightly reports took each morning and if they exceed a certain threshold you deal with it then. Usually these types of reports are set up to run off a reporting database and not the live database.

Reorganizing Queries

If you like puzzles you are likely to love this part of the report writing job. We have found it bizzar just how certain reshuffling of a query can cause the run time to change.

This is where the art of reporting writing comes into play. You have to try lots of different things, some of which may seem ridiculous and see what happens. However, there are some things you can keep in mind to try that might help you get started here.

One of the things we like to do is try to take parts of the WHERE and HAVING clauses out of those parts of the query and move them into sub-queries. This usually requires you to turn a table into a sub-query with its own WHERE clause. The reason this helps is that is sizes down your data before combining it with other data in other tables. The biggest thing that will kill performance on a query is having too much data to churn through. If you can size that data down early, it will help immensely. This may mean you will have to take a single filter from your WHERE or HAVING clause and repeat it multiple times in your query and can make it messy but it also may be the only way to get the performance you need.

Another thing you can try is to combine several of the tables joined into a sub-query. Again this is an attempt to reduce the overall amount of data you are processing in the outer query by reducing the data in sub-queries.

Of course each database has its own way of optimizing queries. Often a rearrangement will have no effect, especially if the query optimizer is well built because it will be a step ahead of you and see the two variations as the same query. This is just something you will have to play around with and see what happens.

Archiving Data

Performance problems come up when you are reporting off of too much data. What "too much data" means depends on the complexity of your query and how well you applied the techniques above. However, more data will always cause your queries to slow down and your performance will drop exponentially. Once performance issues start to surface they will get worse fast.

If your tables have data that you need to keep for historical or other reporting purposes but the bulk of your queries that have performance problems don't need, an option could be to archive off the data from the main table into an archive table. This will keep the amount of data in your main active table down while keeping the older data for the rare report that would need it.

You can archive data by adding a trigger that copies data to a mirrored archive table and then delete the old records from the more active table in a cron job / scheduled task. You can also create a background process that picks up old records and moves them on off-peak times.

Difference Between Development and Production

We have found that often times reports will run quickly on a development environment but when put on production they grind to an almost halt taking hours or longer. This can happen if you are querying tables that have a lot of updates and / or inserts going on.

If you find yourself in this situation you may need to resort to using cache tables or a reporting database. You need to get your queries to get on and off those active tables as quickly as possible or to run on a database that is not so active. These reports can also weigh down other operations on your database and while running cause your application to sputter to a crawl.

Summary

The real work in writing reports lies in first extracting your requirements from your customer and then optimizing your queries so they run in acceptable times. Be realistic about the size of your database and even after you are done, monitor your report times to make sure you are fine tuning your reports as your database grows.