Time Series Forecast with DeepAR
In this notebook we will use SageMaker DeepAR to perform time series prediction. The data we will be using is provided by Kaggle; a global household eletric power consumption data set collected over years from 2006 to 2010. A large dataset like this allows us to make time series prediction over long periods of time, like weeks or months.
Data Exploration
Let's get started by exploring the data and see what's contained within the data set.
! wget https://s3.amazonaws.com/video.udacity-data.com/topher/2019/March/5c88a3f1_household-electric-power-consumption/household-electric-power-consumption.zip--2020-04-22 01:54:44-- https://s3.amazonaws.com/video.udacity-data.com/topher/2019/March/5c88a3f1_household-electric-power-consumption/household-electric-power-consumption.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.142.102
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.142.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20805339 (20M) [application/zip]
Saving to: ‘household-electric-power-consumption.zip’
household-electric- 100%[===================>] 19.84M 8.19MB/s in 2.4s
2020-04-22 01:54:47 (8.19 MB/s) - ‘household-electric-power-consumption.zip’ saved [20805339/20805339]! unzip household-electric-power-consumptionArchive: household-electric-power-consumption.zip
inflating: household_power_consumption.txt with open('household_power_consumption.txt') as file:
for line in range(10):
print(next(file))We've downloaded a text file which has a similar format to that of CSV except it is separated by ;.
Data Preprocessing
The text file has the following attributes,
Each data point has date and time of recording
Each feature is separated by ;
Some values are either NaN or ?, we'll treat them as
NaNin DataFrame
For NaN values, instead of dropping them, we want to fill them with the mean value of that column. This is to ensure our time series is nice and smooth. It's not a terrible assumption to make that if a record is missing, it's likely that the record has a mean value of energy consumption, given that we don't have that many missing values.
Load Text Data into Data Frame
Replace NaN with Mean
Date-Time
2006-12-16 17:24:00
4.216
0.418
234.84
18.4
0.0
1.0
17.0
2006-12-16 17:25:00
5.360
0.436
233.63
23.0
0.0
1.0
16.0
2006-12-16 17:26:00
5.374
0.498
233.29
23.0
0.0
2.0
17.0
2006-12-16 17:27:00
5.388
0.502
233.74
23.0
0.0
1.0
17.0
2006-12-16 17:28:00
3.666
0.528
235.68
15.8
0.0
1.0
17.0
Display Global Active Power
For this demonstration, we will predict global active power. We can ignore the other columns.

The data are recorded each minute, we want to zoom into one day worth of data and see what it looks like.

Hourly vs Daily
With this amount of data, there are many interesting approaches to this problem.
Create many short time series, predict the energy consumption over hours or days.
Create fewer but longer time series, predict the energy consumption over seasons.
For the purpose of demonstrating pandas resampling, we will go with the latter. We need to convert the minute data points into hour or day data points. Pandas' time series tools allow us to easily resample time series data by frequency, e.g. hourly H or daily D

Create Time Series Training Data
The objective is to train a model on a 3 years of data and use the 4th year as the test set to predict what will be the power usage in first few months of 2010. There wil be 3 year-long time series from the years 2007, 2008, and 2009.
Now we can plot the time series and see that there are 3 series, each has length either 365 or 366, depending on whether it is a leap year or not.

Training Feature/Label Split in Time
This is supervised learning, we need to provide our training set with some labels or targets. One simple way to think about it is to split the year-long time series into two chunks. The first chunk is the training, while the second chunk is the label. We are training a model to accept an input time series and return a prediction time series. Let's call the length of the prediction time series prediction_length.
For example, I have 365 days of data. I want my prediction length to be a month or 30 days. The input time series would have 335 data points while the label or target time series would have 30 data points. This split must occur in time though. We cannot randomly choose 30 days out of 365 days.
Now let's visualize the split.

DeepAR
Save as JSON
Before we run DeepAR on SageMaker, we need to do one final data preparation, i.e. converting the data frames into JSON format that is accepted by DeepAR.
DeepAR expects to see input training data in the following JSON fields.
start: a string that defines the starting date of the time series YYYY-MM-DD HH:MM:SStarget: a list of numerical values that represent the time seriescat: optional, a numerical array of categorical features that can be used to encode the groups that the record belongs to. This is useful for finding models per class of item.
For example,
Upload to S3
Just as any other built-in models, SageMaker expects the JSON data to be in a S3 bucket during training and inference job.
DeepAR Estimator
Instantiate an estimator
There are couple hyperparameters we need to set.
epochs: The maximum number of times to pass over the data when training.time_freq: The granularity of time series in the dataset, e.g.Dfor daily.prediction_length: The number of time steps that the model is trained to predict.context_length: The number of data points that the model gets to see before making a prediction.
More information can be found on Deep AR Documentation
When we provide inputs toe the fit function, if we provide a test dataset, DeepAR will calculate the accuracy metrics for the trained model. This is done by predicting the last prediction_length points of each time series in the test set and comparing it to the actual value of the time series. The computed error metrics will be included as part of the log output.
Deploy it and make it ready for inference.
Model Evaluation
Generate Predictions
DeepAR predictor expects JSON for inputs, the input should have the following keys.
instances: A list of JSON formatted time seriesconfigurationoptional: A dictionary of configuration information for the responsenum_samplesoutput_typesquantiles
More information on DeepAR Inference Formats
The prediction JSON would look something like the following. We need to decode string into JSON object and then load the data into a DataFrame.
Visualize the Results
Quantiles 0.1 and 0.9 represent higher and lower bounds for the predicted values.
Quantile 0.5 represents the median of all sample predictions.



Predicting the Future
Now we have verified that the predictor works and it can capture patterns fairly well, we can use it to predict the future, i.e. the months in 2010. We will leave target empty and reserve the 2010 data for testing. In fact we could provide the historical data as target and let the model to predict the future.

The result came out to be not too bad! Now it's time to clean up.
Last updated