5.3 Future ideas
Non-stationary time series
Because the time series are observed over a short period of time (1-3 years), they are assumed to be stationary. But it is possible that the distributions change over time, even over a short period. For the smart meter example, the distribution for a customer moving to a different house or changing electrical equipment can change drastically. To detect these dynamic changes, non-stationarity in time series has to be incorporated while visualizing distributions and also computing distances for two non-stationary time series or one stationary and another non-stationary time series.
Structured missingness
It is possible that data for a whole category of cyclic granularity is unavailable or that there are insufficient observations to compute distributions. For example, a customer does not have data for a particular day of the week or month throughout their observation period. While visualizing probability distributions across categories in Chapters 2 and 3, this can be indicated by displaying dot plots instead of summarizing distributions. But the distances in Chapter 4 can not handle missing observations if they are structured like this. More research is needed to design a distance that can incorporate customers with structural missingness and also comprehend its implications while visually characterizing them.
Anomalies
Characterizing clusters with varied or outlying customers can result in patterns that do not represent the group. Moreover, integrating heterogeneous consumers may result in visually identical end clusters, which are potentially not useful. An appropriate anomaly detection approach could be applied to filter the anomalies, and then a decision has to be made regarding if the anomalies should form a group or be assigned to the closest group through a classification method.
Develop inferential methods for the clusters
It is important to determine whether or not any detected clusters are statistically significant. In Chapter 4, we have used a permutation approach to detect if patterns are authentic or not across one or a pair of granularities. A similar approach could be used for finding the significance of clusters, but permutations used for comparisons must adjust for the fact that different temporal granularities have distinct conditional distributions.
Scaling up for larger uncertainty and computational efficiency
Larger data sets with more uncertainty complicate matters, with potentially more problems than already stated above. More research is needed to understand what other sorts of problems may arise with the current methodology and possible ways to address them. All of this must be done while keeping computational efficiency in mind, as this is critical when scaling up for the analysis of large data sets.
Scaling up the methods for multivariate time series data
Another possible extension would be to create a similar framework for visualizing and analyzing multivariate time series data. With multiple time series available for each observation, the complexity of efficient exploration and visualization grows exponentially. In this case, conditional distributions include not only temporal dependency but also variables and their dependencies. This adds to the already high-dimensional data structures that result from studying distributions. This big problem can be tackled by first incorporating time’s inherent characteristics while visualizing one or a few multivariate time series data. Unsupervised clustering can then be used to group multiple time series across multiple time granularities and variables. This is a method similar to the one used in this thesis for dealing with univariate time series.