Archive for the ‘Performance Tuning’ Category

This is second article in the series of Cassandra Data Modeling best practices for efficient Spark SQL Joins. You can find previous posts here  and here.

Dimensional data modeling principles are still applicable when joining Cassandra tables using Spark SQL. However, we have to apply certain additional techniques due to the nature how Cassandra database works.

Since there is abundance of information on typical dimensional modeling approaches, I will be focusing more on the Cassandra and Spark SQL aspects of the modeling process. If you are new to dimensional modeling you can visit kimball site or do a quick search on dimensional modeling.


Principle #1: Minimize number of tables in Star Schema

Avoid snow flakes through denormalization. Keep the number of tables in as minimum as possible. As you have seen in Spark SQL article referenced above, having more tables in Join operation will add more shuffling and increases overall time of query execution.

Principle #2:  Avoid outer join relationships in the Star Schema

Use denormalization to remove outer join relationships. Spark SQL doesn’t pass filters to outer joining table (at least in version 1.4). So, outer-joining table will always become a full table scan.

Cassandra Specific Data Modeling Aspects – Partition Keys and Cluster Keys

Principle #3: Pick right partition keys to minimize number of partitions in Cassandra tables and apply filters in Cassandra as much as possible

Selecting right partition keys is the most important aspect of the data modeling process. In general, good Cassandra table partition key/keys meet following criteria.

Low but not too Low Cardinality: These keys enable you to store more rows in single partition. Most often these columns are used in Group By’s of your queries.

Used in Report filters: These keys are most common mandatory filters in your queries. Spark SQL queries much more efficient when partition predicates are pushed down to Cassandra database. This will help minimize the amount data flowing between Cassandra and Spark. Hence, more efficient compute resource utilization.

Non updatable: These columns are not updated.

Principle #4: Limit your max partition size to 1 GB

Cassandra read performance will suffer once your partition sizes start to grow beyond 1 GB.

Principle #5: Make sure partition growth is contained. Add computed columns to shard partitions if necessary

Depending on your business process and data model, you need to take preventive measures to ensure data in partitions cannot grow beyond limits. This may require adding extra columns to your partition keys to contain growth.

Ex: Adding a month as partition key on a transaction table ensure that new partitions are created every month.

Principle #6: Optimize number of partitions in your table

Continue to tweak your partition key list until your max partition size reaches 1 GB limit. This will ensure that your table will have optimal number of partitions. Having fewer partitions will lead to fewer Spark tasks during SQL execution.

It may not be feasible to have all partitions in table to have 1 GB size. This will not happen in 99% of data modeling situations. Your model will be fine as long as max partition in table is around 1 GB limit.

In some situations it may not be feasible to achieve 1 GB max partition size. This is OK. Main point here is to maximize the number of rows you can fit in a partition as much as possible.

Clustering Keys

Principle #7: Choose additional columns that make up primary keys to uniquely identify rows in a partition

Principle #8: Avoid picking updatable columns as cluster keys

Just like in relational tables, Cassandra primary keys are not updatable. Under the covers, Cassandra update is same as insert. So, updating keys will create a new row. Design appropriate application logic to implement updates if necessary.

Principle #9: Leverage predicate pushdown capability on cluster key columns

Ensure to keep most frequently filtered columns in the cluster key list at the top. Some time if your table has non-updatable columns that are used to filter rows most often, you can include them as part of cluster keys even if they are not required to define uniqueness of rows.


Read Full Post »

Modern Business Intelligence tools offer capabilities to developers and end-users to abstract table relationship by offering metadata capabilities with in the tool. Administrators of the tool are responsible to setting up connections and tables definitions and relationships (certain advanced tools).

Users would then go into the tool and pick necessary tables/fields from a catalog or library to create their queries. They do not have to know physical location or connection details for accessing data.

While this is a great feature, I have often come across several shops where inefficient metadata design and poor queries cause a great deal of performance issues. My analysis is applicable to several tools available in the market place today. So I’m writing this blog at a conceptual level where readers can translate this into their respective tools and environments.




No matter how many promises are made by BI vendors, you will achieve optimal performance when majority of the work is lifted by database engines. Below are the best and case scenario’s when it comes performance. Keep these in mind before designing your metadata layer, logical models and queries.


db vs bi server



Let us examine below example to understand how performance is impacted by where work gets done.


Scenario 1: Single table query

Select col1, sum(col2) as amount from d1_table where col1 = ‘x’ group by col1;


BI Server recognizes that query belongs to one connection. Query gets pushed to database and BI server receives final results.

Performance: As good as DB performance

Scenario 2: Query with more than one table from same database connection


Select t2.col1, sum(t1.col2) as amount

from d1_table1 t1,

d1_table2 t2


t1.col1 = ‘x’

and t1.col1 = t2.col1

group by t1.col1;




BI Server recognizes that query belongs to one connection. Query gets pushed to database and BI server receives final results.

Performance: As good as DB performance

Scenario 3: Query with more than one table from different database connections


Select t2.col1, sum(t1.col2) as amount

from d1_table1 t1,

d2_table2 t2


t1.col1 = ‘x’

and t1.col1 = t2.col1

group by t1.col1;

BI server recognizes that query is using tables from different databases. So, BI server fetches individual table data from each connection and performs the join and aggregation on BI server.

This is very bad for performance especially if tables involved are large.

Increases stress on network.

Increases stress on BI server.


Performance: Bad

Scenario 4: Query between table and file

Similar concept like scenario3. BI server recognizes that join cannot be pushed to database. So, it bring all necessary data on to BI server to perform necessary join, aggregation and sort operations.



I have discussed a common problem across different tools.  Exact solution is different for each tool. However, conceptually they are the same. Make sure that your join and aggregation is performed by databases as much as possible. This is the secret sauce for high performing BI architecture.

Read Full Post »

Over the years I have noticed this practices at several companies. In order to help developers avoid errors or to avoid time spent on proper training and knowledge transfer, complex queries are converted into views and handed off to development teams.

These views are great and they do help cut down developer errors and cut down Knowledge transfers. However, there is huge hidden cost behind using views for developing reports and analytics.

For example, lets review below view syntax.


Create or replace view myview as

Select t1.col1, t2.col2,t3.col3, t4.col4, t5.col5, t6.col6

From table1 t1,

        table2 t2,

        table3 t3,

        table4 t4,

        table5 t5,

        table6 t6

Where t1.col1 = t2.col2



Above view has 6 tables involved in a join. When using this view for reporting or querying purposes database is going to perform JOIN operations on all tables involved whether query is actually referring to individual table columns are not.


Users will see a significant slowness in performance due to this JOIN operation on all tables. The effect is negligible if tables are small and results are rendered quickly. However, the performance will worsen as table sizes increase.


Below are few alternative options. My advise may sound counter intuitive, but worth considering if you want to keep your users happy.

Option 1: For shops with BI tools

If you have any good business intelligence tools (Business Objects, OBIEE, Web Focus and others)  in your shop, model table relationships in the tool.

Proper tables will be brought into join relationship based on column selection.

Some of these tools allow you to pull in implicit columns if you must enforce certain joins whether or not columns selected by users

Option 2: Writing queries directly on database without any BI tools

Avoid using complex views for simplicity during development. Train your developers on building proper queries based on business needs.

Limit views to those situations where data must be restricted to satisfy all necessary join conditions due to business logic.

Read Full Post »