Modern Business Intelligence tools offer capabilities to developers and end-users to abstract table relationship by offering metadata capabilities with in the tool. Administrators of the tool are responsible to setting up connections and tables definitions and relationships (certain advanced tools).
Users would then go into the tool and pick necessary tables/fields from a catalog or library to create their queries. They do not have to know physical location or connection details for accessing data.
While this is a great feature, I have often come across several shops where inefficient metadata design and poor queries cause a great deal of performance issues. My analysis is applicable to several tools available in the market place today. So I’m writing this blog at a conceptual level where readers can translate this into their respective tools and environments.
No matter how many promises are made by BI vendors, you will achieve optimal performance when majority of the work is lifted by database engines. Below are the best and case scenario’s when it comes performance. Keep these in mind before designing your metadata layer, logical models and queries.
Let us examine below example to understand how performance is impacted by where work gets done.
Scenario 1: Single table query
Select col1, sum(col2) as amount from d1_table where col1 = ‘x’ group by col1;
BI Server recognizes that query belongs to one connection. Query gets pushed to database and BI server receives final results.
Performance: As good as DB performance
Scenario 2: Query with more than one table from same database connection
Select t2.col1, sum(t1.col2) as amount
from d1_table1 t1,
d1_table2 t2
where
t1.col1 = ‘x’
and t1.col1 = t2.col1
group by t1.col1;
BI Server recognizes that query belongs to one connection. Query gets pushed to database and BI server receives final results.
Performance: As good as DB performance
Scenario 3: Query with more than one table from different database connections
Select t2.col1, sum(t1.col2) as amount
from d1_table1 t1,
d2_table2 t2
where
t1.col1 = ‘x’
and t1.col1 = t2.col1
group by t1.col1;
BI server recognizes that query is using tables from different databases. So, BI server fetches individual table data from each connection and performs the join and aggregation on BI server.
This is very bad for performance especially if tables involved are large.
Increases stress on network.
Increases stress on BI server.
Performance: Bad
Scenario 4: Query between table and file
Similar concept like scenario3. BI server recognizes that join cannot be pushed to database. So, it bring all necessary data on to BI server to perform necessary join, aggregation and sort operations.
Recommendations:
I have discussed a common problem across different tools. Exact solution is different for each tool. However, conceptually they are the same. Make sure that your join and aggregation is performed by databases as much as possible. This is the secret sauce for high performing BI architecture.