vignettes/gravitas_vignette.Rmd
gravitas_vignette.Rmd
Temporal data are available at various resolutions depending on the context. Often in time series analysis, data collected at finer scales needs to be explored also at coarser temporal scales. This approach requires deconstructing time in various possible ways leading to linear (e.g. days, weeks), circular (e.g. day-of-week) or aperiodic (e.g. day-of-month) time granularities. The hierarchical structure of time also creates a natural nested ordering resulting in single or multiple order-up granularities. For a conventional Gregorian calendar, examples of single-order-up granularity include hour-of-day and minute-of-hour, while examples of multiple-order-up granularity may include hour-of-week or minute-of-day. All of these granularities might be useful in throwing light on the periodicity of the data.
Exploration of large quantities of temporal data across so many different deconstructions of time becomes clumsy without a systematic approach.
This vignette describes the tools in the package gravitas
for systematically exploring large quantities of temporal data across different time deconstructions by visualizing the probability distributions using a range of graphics in ggplot2
.
The smart_meter10
data set in the package is a tsibble providing the energy consumption for ten households from customer trials. It is hard to derive insights when we try to visualize periodicity in the data for all households. For example, if we plot the energy consumption across each hour of the day, the structure of the data suggests that there are several data points for each hour of the day for just one household. Hence, for several households, there will be a blob of points for each hour of the day.
smart_meter10 #> # A tsibble: 293,874 x 3 [30m] <UTC> #> # Key: customer_id [10] #> customer_id reading_datetime general_supply_kwh #> <chr> <dttm> <dbl> #> 1 10006414 2012-02-10 08:00:00 0.141 #> 2 10006414 2012-02-10 08:30:00 0.088 #> 3 10006414 2012-02-10 09:00:00 0.078 #> 4 10006414 2012-02-10 09:30:00 0.151 #> 5 10006414 2012-02-10 10:00:00 0.146 #> 6 10006414 2012-02-10 10:30:00 0.077 #> 7 10006414 2012-02-10 11:00:00 0.052 #> 8 10006414 2012-02-10 11:30:00 0.055 #> 9 10006414 2012-02-10 12:00:00 0.055 #> 10 10006414 2012-02-10 12:30:00 0.252 #> # … with 293,864 more rows
So the question is how to explore systematically multiple perspectives of this smart meter data across different time granularities to find regular patterns or anomalies in behaviors? One way can be by visualizing the probability distributions across these granularities.
Let us see how we can explore the energy behavior of a couple of households by making allowances for the following aspects of time granularities: Computation, interaction and visualization.
search_gran
Typically, the first thing we should have at our disposal for examining periodicities of energy behavior across time granularities is to know the number of time granularities we can look at exhaustively. If we consider conventional time deconstructions for a Gregorian calendar (second, minute, half-hour, hour, day, week, fortnight, month, quarter, semester, year), the following time granularities can be considered for this analysis.
tsibble::interval(smart_meter10) #> <interval[1]> #> [1] 30m smart_meter10 %>% search_gran() #> [1] "hhour_hour" "hhour_day" "hhour_week" #> [4] "hhour_fortnight" "hhour_month" "hhour_quarter" #> [7] "hhour_semester" "hhour_year" "hour_day" #> [10] "hour_week" "hour_fortnight" "hour_month" #> [13] "hour_quarter" "hour_semester" "hour_year" #> [16] "day_week" "day_fortnight" "day_month" #> [19] "day_quarter" "day_semester" "day_year" #> [22] "week_fortnight" "week_month" "week_quarter" #> [25] "week_semester" "week_year" "fortnight_month" #> [28] "fortnight_quarter" "fortnight_semester" "fortnight_year" #> [31] "month_quarter" "month_semester" "month_year" #> [34] "quarter_semester" "quarter_year" "semester_year"
The interval of this tsibble is 30 minutes, and hence the default for search_gran
in this case, provides temporal granularities ranging from half-hour to year. If these options are considered too many, the default options can be modified to limit the possibilities. For example, the most coarse temporal unit can be set to be a “month”.
smart_meter10 %>% search_gran(highest_unit = "month") #> [1] "hhour_hour" "hhour_day" "hhour_week" "hhour_fortnight" #> [5] "hhour_month" "hour_day" "hour_week" "hour_fortnight" #> [9] "hour_month" "day_week" "day_fortnight" "day_month" #> [13] "week_fortnight" "week_month" "fortnight_month"
This looks better. However, some intermediate temporal units might not be pertinent to the analysis and we might want to remove them from the list of granularities that we want to examine using the argument filter_out
in search_gran
.
smart_meter10 %>% search_gran(highest_unit = "month", filter_out = c("hhour", "fortnight") ) #> [1] "hour_day" "hour_week" "hour_month" "day_week" "day_month" #> [6] "week_month"
create_gran()
After we have the set of granularities to look at, we should be able to compute any granularity using create_gran
. These can be be used for exploring the distribution of the time series across univariate temporal granularity or computing summary statistics across these categorizations.
The data is filtered for two customers and a new granularity “day_fortnight” (day of the fortnight) is created and visualized to understand if the electricity consumption is different across days of the fortnight. From the letter value plot, we can not clearly observe any pattern across days of the fortnight for each household.
The difference in the distribution of energy consumption for these two households can be observed in the following plot. Let us see if we can derive any other insights than the otherwise obvious observation that in most cases the second household has much less consumption than the first one. If we consider letter value F as a regular behavior and letter values beyond F as not-so-regular behavior, we can conclude that the regular behavior of the first household is more stable than the second household. However, the distribution of tail of the first household is more variable, observed through distinct letter values, implying that their not-so-regular behavior is quite extreme.
library(lvplot) library(ggplot2) library(dplyr) library(tibble) smart_meter10 %>% filter(customer_id %in% c("10006704", "10017936")) %>% create_gran("day_fortnight") %>% ggplot2::ggplot(aes( x = as.factor(day_fortnight), y = general_supply_kwh)) + xlab("day_fortnight") + geom_lv( outlier.colour = "red", aes(fill = ..LV..), k = 5) + facet_wrap(~customer_id) + scale_fill_lv() + theme_bw() + scale_y_sqrt()
We can visualize the distribution of energy across univariate time granularities for deriving insights on periodic behavior. From the search list, we found that we can look at six granularities, that amounts to analyzing six graphics. However, what happens if we want to see the distribution of energy across two granularities at a time? This is equivalent to looking at the distribution of energy consumption across one granularity conditional on another one. One way can be to plot one of the granularities on the x-axis and another on the facet. Different perspectives of the data can be derived depending on where the granularities are places.
So, what is the number of pairs of granularities we can look at? It is equivalent to taking 2 granularities from 6, which essentially means we need to examine 30 plots. The good news is, not all time granularities can be plotted together and we do not have to analyze so many plots!
Harmony/clash can be identified to considerably reduce the number of visualizations that can aid exploratory analysis.
smart_meter10 %>% is_harmony(gran1 = "hour_day", gran2 = "day_week") #> [1] "TRUE" smart_meter10 %>% is_harmony(gran1 = "hour_day", gran2 = "day_week", facet_h = 14) #> [1] "FALSE" smart_meter10 %>% is_harmony(gran1 = "day_month", gran2 = "week_month") #> [1] "FALSE"
Let us now look at all the harmonies that we can examine. Fortunately, we are left with only 13 out of 30 visualizations. We can also play with the argument facet_h
, which represents the maximum number of levels allowed for the facet variable, to reduce or increase the number of harmonies.
smart_meter10 %>% harmony( ugran = "month", filter_out = c("hhour", "fortnight") ) #> # A tibble: 10 x 4 #> facet_variable x_variable facet_levels x_levels #> <chr> <chr> <int> <int> #> 1 day_week hour_day 7 24 #> 2 day_month hour_day 31 24 #> 3 week_month hour_day 5 24 #> 4 hour_day day_week 24 7 #> 5 day_month day_week 31 7 #> 6 week_month day_week 5 7 #> 7 hour_day day_month 24 31 #> 8 day_week day_month 7 31 #> 9 hour_day week_month 24 5 #> 10 day_week week_month 7 5
Now, we know the pairs of granularities that we need to look at using harmonies. Next, we want to visualize the distribution of the measured variable general_supply_kwh
across these harmonies. This is possible through prob_plot
. We have several options for visualizing statistical distributions and each comes with some pros and cons which we need to consider while choosing a good one for our context. Traditional methods of plotting distributions include boxplots and violin plots, while more recent methods include ridge plots, quantile plots or letter-value plots.
gran_advice
provides checkpoints before plotting a distribution. It includes recommended distribution plots depending on the levels of the two granularities plotted. Assumptions are made to ensure the display is not too cluttered by the space occupied by various kinds of distribution plots. Moreover, it also provides advice on the number of observations to use for drawing the distribution plot so that one can ensure that they are enough before jumping on to plot a distribution. Furthermore, it advises if the number of observations varies significantly across facets or within facets.
smart_meter10 %>% gran_advice( "week_month", "hour_day" ) #> The chosen granularities are harmonies #> #> Recommended plots are: violin lv quantile boxplot #> #> Number of observations are homogenous across facets #> #> Number of observations are homogenous within facets #> #> Cross tabulation of granularities : #> #> # A tibble: 24 x 6 #> hour_day `1` `2` `3` `4` `5` #> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 0 2827 2810 2825 2806 969 #> 2 1 2827 2816 2828 2806 968 #> 3 2 2825 2815 2830 2805 969 #> 4 3 2826 2814 2832 2804 970 #> 5 4 2826 2814 2828 2807 970 #> 6 5 2826 2814 2824 2811 970 #> 7 6 2827 2813 2825 2806 970 #> 8 7 2828 2812 2826 2802 970 #> 9 8 2828 2816 2827 2802 972 #> 10 9 2826 2818 2828 2802 974 #> # … with 14 more rows
The function gran_obs
can be employed to see if the number of observations per category is large enough to draw a distribution plot. In this case, the number of observations is much less on weekends compared to weekdays, but enough to have reliable estimates of distribution.
smart_meter10 %>% gran_obs( "week_month", "wknd_wday" ) #> # A tibble: 2 x 6 #> wknd_wday `1` `2` `3` `4` `5` #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Weekday 48450 48224 48502 47965 16431 #> 2 Weekend 19454 19252 19335 19360 6901
Through prob_plot
we can choose one from the list of recommended plots. Warnings are generated if users try to plot clashes, the number of facet variables is too high or the number of observations used to compute statistical summaries for distribution is not large enough. Currently, we can choose from “boxplot”, “violin”, “ridge”, “lv” or “quantile” to visualize distributions.
We visualize the harmony pair (wknd_wday, hour_day) through a box plot. Boxplot of energy consumption is shown across wknd_wday (facet) and hour-day (x-axis) for the same two households. For the second household, outliers are less prominent implying their regular behavior is more stable. For the first household, energy behavior is not significantly different between weekdays and weekends. For the second household, median energy consumption for the early morning hours is extremely high for weekends compared to weekdays.
cust1 <- smart_meter10 %>% filter(customer_id %in% c(10006704)) %>% prob_plot("wknd_wday", "hour_day", response = "general_supply_kwh", plot_type = "boxplot") + scale_y_sqrt() + ggtitle("Energy consumption distribution for customer id: 10006704") #> Joining, by = c("wknd_wday", "hour_day") #> Joining, by = c("wknd_wday", "hour_day")
cust2 <- smart_meter10 %>% filter(customer_id %in% c("10017936"))%>% prob_plot("wknd_wday", "hour_day", response = "general_supply_kwh", plot_type = "boxplot") + scale_y_sqrt() + ggtitle("Energy consumption distribution for customer id: 10017936") #> Joining, by = c("wknd_wday", "hour_day") #> Joining, by = c("wknd_wday", "hour_day")
Similarly, we can plot the other harmony pairs and decide if they are useful in deriving insights about the energy behavior of the customers.
To develop a framework so that these individual distributions are utilized to group subjects into similar pockets of behavior.
Make gravitas
work with big data tools like sparklyr
Any periodic/aperiodic temporal granularities computation given a hierarchical structure/calendar.
Thanks to PhD supervisors Prof. Rob J Hyndman, Prof. Dianne Cook and Google Summer of Code 2019 mentor Prof. Antony Unwin for their support and always leading by example. The fine balance of encouraging me to work on my ideas and stepping in to help when I need has made the development of this package a great learning experience for me.
Moreover, I want to thank my cohort at NUMBATS, Monash University, especially Mitchell O’Hara-Wild and Nicholas Spyrison for always lending an ear and sharing their wisdom and experience of developing R packages, with such kindness.