Recently, my Data Analytics team had to produce a cumulative downloads and revenue report for every mobile app in every store, starting from the apps’ launch dates. Even though it is very easy to finish this task using any data analytics framework in Python or other languages, we wanted to try it in MySQL and see if it is feasible. As it turned out, calculating cumulative summary in MySQL (or any other SQL system) is possible, yet challenging and fun.
Let’s cut the chase, here is the sample structure of our database table named app_daily
:
app_name | store | country | start_date | downloads | revenue |
---|---|---|---|---|---|
Hello World | App Store | VN | 2016-01-01 | 10 | 16.5 |
Lorem Ipsum | Google Play | US | 2016-01-01 | 9 | 22.67 |
Donor | Amazon | UK | 2016-01-01 | 1 | 0 |
Hello World | App Store | US | 2016-01-02 | 19 | 31.35 |
Lorem Ipsum | Google Play | UK | 2016-01-02 | 12 | 30.23 |
Donor | Amazon | VN | 2016-01-02 | 18 | 53.82 |
The question here — again — is:
Find the cumulative downloads and revenue of every app, in every store, starting from launch date (the first date the app appeared). We don’t need to care about the country.
Here is the expected result which we want to have:
store | app_name | start_date | cumulative_downloads | cumulative_revenue |
---|---|---|---|---|
Amazon | Donor | 2016-01-01 | 1 | 0 |
Amazon | Donor | 2016-01-02 | 19 | 53.82 |
App Store | Hello World | 2016-01-01 | 10 | 16.5 |
App Store | Hello World | 2016-01-02 | 29 | 47.85 |
Google Play | Lorem Ipsum | 2016-01-01 | 9 | 22.67 |
Google Play | Lorem Ipsum | 2016-01-02 | 21 | 52.90 |
How to do it in pure SQL?
SQL has many ways to calculate cumulative summary. We will utilize View and Correlated query in our case:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | -- Creates a custom View as a shortcut for our initial query create or replace view ad as ( select store, app_name, start_date, sum(downloads) as downloads, sum(revenue) as revenue from app_daily group by store, app_name, start_date ); -- Calculates Cumulative summary select t.store, t.app_name, t.start_date, ( select sum(x.downloads) from ad x where x.start_date <= t.start_date and x.app_name = t.app_name and x.store = t.store ) as cumulative_downloads, ( select sum(x.revenue) from ad x where x.start_date <= t.start_date and x.app_name = t.app_name and x.store = t.store ) as cumulative_revenue from ad t order by store, app_name, start_date; |
Pros:
- Portable in every SQL system.
Cons:
- Have to create a View. May need to clean up that View when everything is done.
- Have to do correlated query twice.
- Take a long time to run if the table is very big because each correlated query had to calculate again and again.
How to do it in MySQL?
MySQL has a very powerful variable system. This query below will make good use of that.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | select ad.store, ad.app_name, ad.start_date, -- Checks if the current row is still for the same app and store -- If not, resets the cumulative_sum @downloads_total := if( @prev_app = ad.app_name and @prev_store = ad.store, @downloads_total, 0 ) + ad.downloads as cumulative_downloads, @revenue_total := if( @prev_app = ad.app_name and @prev_store = ad.store, @revenue_total, 0 ) + ad.revenue as cumulative_revenue, -- Assigns the previous app and store again. @prev_app := ad.app_name, @prev_store := ad.store from ( select app_name, store, start_date, sum(downloads) as downloads, sum(revenue) as revenue from app_daily group by app_name, store, start_date ) ad join ( select @prev_app := '', @prev_store := '', @downloads_total := 0, @revenue_total := 0 ) vars order by store, app_name, start_date; |
Pros:
-
Faster than the pure SQL way, because we can reuse the variables’ values and don’t have to recalculate every time like how the correlated query did.
In our production table which has 2,981,651 rows — at the writing time — the running time is 2.87 times faster than the pure SQL method.
-
Only one query to do the trick.
Cons:
- Only applicable in MySQL. Incompatible with other SQL systems.
- Have two extra columns which need to be deleted after having the resulting table.
Till next time ;)
Luan Nguyen Thanh
01:02 AM