Complex queries can take a very long runtime and may cost expensive computing resources, therefore data analysts or any others who want to retrieve data from databases should be familiar with methods for optimizing queries. In this short guide I will describe common and effective methods for optimizing queries. These methods rely on writing better queries in SQL and intelligent use of table indexes.
Optimize queries by using smaller tables
Sometimes we perform expensive database queries on large tables instead of performing the same queries on filtered tables.
Here are two generic examples in which these cases may occur:
Filter data before aggregation
Request for data retrieval – Create a table with a monthly trend of all purchases in 2021.
Not Optimized solution – Group all purchases by month of purchase and then filter the query with HAVING clause the months of the year 2021. In this approach, the database engine will aggregate all the data in the table and then remove the irrelevant data.
Optimized query solution – Filter the months of 2021 with WHERE clause before the aggregation and then perform the aggregation.
Why does the optimized query give a better solution?
This query is more efficient because when filtering the data before the aggregation the database engine will have to aggregation only the data of the year 2021 and not the entire table.
Joining subqueries instead of full tables
Request for data retrieval – Analyze the characteristics of customers who purchased in January 2022 by their cities of residence as they appear in the customer table in CRM.
Not Optimized solution – Join the purchase table to the customer table in the CRM and then filter with WHERE clause the customers who purchased in January 2022.
Optimized query solution – For more optimized query we can create a subquery with only the customers who purchased in January 2022 and then join the customer table from the CRM.
Why does the optimized query give a better solution?
The order in which the Database engine performs SQL is JOIN clause before WHERE clause, therefore at the query the engine will join the customer details from the CRM table to all the customers while in the optimize query we create a small sub-table where only January 2022 customers will be joined.
Note – In most cases, when we join a table that was created by a subquery there will be a significant improvement in runtime than if we would have joined the whole table. However, in some cases when we join subqueries the performance might be worse than joining the whole table because the subquery does not have an index, In the next section I will talk a bit about indexes.
Using indexes to optimize queries
Table Index is extra information that the database engine stories about one one or more fields in the table. The index can later help the database to fetch the data faster. You may think of indexes as library cards that can help the librarant find immediately the book you need instead of searching the whole library’s book one by one.
Example – In customers table in which many queries filter the data by customer_name field, it would be worthwhile to create an index on the customer_nmae field. If for instance we want to find the customers whose first letter is K, the engine will use a small table he had already created (Hash table) which will direct it to the row in order to find the customers that begin with the letter K. Without the index the database engine would have to scan the whole table to find these customers.
There are many methods on how to create indexes on tables, but these methods are beyond the scope of this guide.
Technically it is very easy to create an index. It can be created through the database management software or using this SQL statement:
CREATE INDEX AAA_idx ON BBB (AAA);
When:
AAA – The name of the field on which we will create the index.
BBB – The name of the table in which the field is located.
Of course, if you want to create more complex indexes, then the command will also be more complex.
Most tables already have indexes that were created when the table was designed and do not need to be re-indexed (most of the time primary keys are also an index). But, if there is a field in the table on which we perform many operations like filtering, searching or joining and there is no index on it, creating an index will greatly improve the performance of these operations.
Note that there are some DHW types that do not support indexes (eg snowflake). In these DWH in most cases you can use clusters instead of indexes.
How to optimize complex queries with an indexed table?
Let’s look again at the task from the previous section:
If the runtime of the optimized query we created at the previous section has not improved (or even increased) because the subquery lost the index, we have an option to create a sub-table that is derived from the main table and add index to it. This is a more complex option but sometimes we have no choice but to use it to improve our query.
In order to retrieve the requested data we will create a new table that contains only the customers who purchased in January (You may even create a temporary table that will only be used for this analysis). On the sub table we created we will add an index on the field of customer_name and join the customer characteristics from the CRM tables.
Note that in order to create a new table and add an index on it, you must make sure that you have the appropriate permissions in the database. Also, if you did not create a temporary table, don’t forget to delete the table at the end of use to free up space from the database and not leave unnecessary junk tables.
Working with fields that have an index
If we already have fields that have indexes, we should carefully use these fields and not manipulate them when executing expensive commands because when we manipulate fields with index the database would treat them as new fields and won’t use the index.
For example – instead of filtering the purchases made in 2021 using the year function on purchase_date field we would prefer to use the functions between to filter out the purchase in 2021
year(purch_date) = 2021
purch_date between ‘2021-01-01’ and ‘2021-21-31’
Optimize queries by using aggregate tables instead of granularity tables
Request data retrieval – Extract the amount of income from the purchase table, group it by departments (the department field exists in the purchase table) and add the name of the departments manager. The manager name of the department is located in another table that contains the department details.
Not Optimized solution – Join the table of department details to the purchase table (by using the department name field) and group the purchase by department name and manager name.
Optimized solution – Create a subquery that groups the purchase table with departments and their amount of income and to that subquery join the names of the managers from the department details table.
Why does the optimized query give a better solution?
The optimized query is more optimized because the table we created in the subquery is a grouped table and has one row to each department and therefore it is much smaller than the table of all purchases, Joining tables with a smaller row quantity will be more efficient.
Writing queries to filter rows inside IN function
In writing SQL queries there is a practice in which subqueries are written inside the IN function.
For example, to filter the customers who live in N.Y. from the purchase table one can use the IN function at the WHERE clause section and write a subquery that returns the customer numbers who live in N.Y. from the customer table.
The problem of using a query within the IN function is that subqueries lose the table index. Instead of using an in function we should use a regular JOIN clause and join the tables on the city_name field, in this way we can enjoy the benefit of the indexes in the table.
Yuval Marnin,
Data analyst.