In this notebook we will use SageMaker DeepAR to perform time series prediction. The data we will be using is provided by Kaggle; a global household eletric power consumption data set collected over years from 2006 to 2010. A large dataset like this allows us to make time series prediction over long periods of time, like weeks or months.
Data Exploration
Let's get started by exploring the data and see what's contained within the data set.
We've downloaded a text file which has a similar format to that of CSV except it is separated by ;.
Data Preprocessing
The text file has the following attributes,
Each data point has date and time of recording
Each feature is separated by ;
Some values are either NaN or ?, we'll treat them as NaN in DataFrame
For NaN values, instead of dropping them, we want to fill them with the mean value of that column. This is to ensure our time series is nice and smooth. It's not a terrible assumption to make that if a record is missing, it's likely that the record has a mean value of energy consumption, given that we don't have that many missing values.
print('Number of missing values per column')df.isnull().sum()
Number of missing values per column
Global_active_power 25979
Global_reactive_power 25979
Voltage 25979
Global_intensity 25979
Sub_metering_1 25979
Sub_metering_2 25979
Sub_metering_3 25979
dtype: int64
print('Numer of values per column')df.count()
Numer of values per column
Global_active_power 2049280
Global_reactive_power 2049280
Voltage 2049280
Global_intensity 2049280
Sub_metering_1 2049280
Sub_metering_2 2049280
Sub_metering_3 2049280
dtype: int64
Replace NaN with Mean
num_cols =len(list(df.columns.values))for col inrange(num_cols): df.iloc[:,col]=df.iloc[:,col].fillna(df.iloc[:,col].mean())print('Number of missing values per column')df.isnull().sum()
Number of missing values per column
Global_active_power 0
Global_reactive_power 0
Voltage 0
Global_intensity 0
Sub_metering_1 0
Sub_metering_2 0
Sub_metering_3 0
dtype: int64
df.head()
Global_active_power
Global_reactive_power
Voltage
Global_intensity
Sub_metering_1
Sub_metering_2
Sub_metering_3
Date-Time
2006-12-16 17:24:00
4.216
0.418
234.84
18.4
0.0
1.0
17.0
2006-12-16 17:25:00
5.360
0.436
233.63
23.0
0.0
1.0
16.0
2006-12-16 17:26:00
5.374
0.498
233.29
23.0
0.0
2.0
17.0
2006-12-16 17:27:00
5.388
0.502
233.74
23.0
0.0
1.0
17.0
2006-12-16 17:28:00
3.666
0.528
235.68
15.8
0.0
1.0
17.0
Display Global Active Power
For this demonstration, we will predict global active power. We can ignore the other columns.
import matplotlib.pyplot as plt%matplotlib inlineplt.figure(figsize=(12,6))active_power_df.plot(title='Global Active Power', color='green')plt.show()
The data are recorded each minute, we want to zoom into one day worth of data and see what it looks like.
# There are 1440 minutes in a dayplt.figure(figsize=(12,6))active_power_df[0:1440].plot(title='Global Active Power Over 1 Day', color='green')plt.show()
Hourly vs Daily
With this amount of data, there are many interesting approaches to this problem.
Create many short time series, predict the energy consumption over hours or days.
Create fewer but longer time series, predict the energy consumption over seasons.
For the purpose of demonstrating pandas resampling, we will go with the latter. We need to convert the minute data points into hour or day data points. Pandas' time series tools allow us to easily resample time series data by frequency, e.g. hourly H or daily D
# Set frequency to be dailyfreq ='D'mean_active_power_df = active_power_df.resample(freq).mean()plt.figure(figsize=(12,6))mean_active_power_df.plot(title='Global Active Power Mean per Day', color='green')plt.show()
Create Time Series Training Data
The objective is to train a model on a 3 years of data and use the 4th year as the test set to predict what will be the power usage in first few months of 2010. There wil be 3 year-long time series from the years 2007, 2008, and 2009.
defcreate_time_series_list_by_years(df,years,freq='D',start_idx=0):"""Creates time series for each supplied year in the years list. """ leap ='2008'# We should account for all leap years but for the purpose of this demo, 2008 is enough time_series_list = []for i inrange(len(years)):if years[i]== leap: end_idx = start_idx +366else: end_idx = start_idx +365 index = pd.date_range(start=years[i] +'-01-01', end=years[i] +'-12-31', freq=freq) time_series_list.append(pd.Series(data=df[start_idx:end_idx], index=index)) start_idx = end_idxreturn time_series_list
Now we can plot the time series and see that there are 3 series, each has length either 365 or 366, depending on whether it is a leap year or not.
time_series_list =create_time_series_list_by_years(mean_active_power_df, ['2007', '2008', '2009'], start_idx=16)plt.figure(figsize=(12,6))for ts in time_series_list: ts.plot()plt.show()
Training Feature/Label Split in Time
This is supervised learning, we need to provide our training set with some labels or targets. One simple way to think about it is to split the year-long time series into two chunks. The first chunk is the training, while the second chunk is the label. We are training a model to accept an input time series and return a prediction time series. Let's call the length of the prediction time series prediction_length.
For example, I have 365 days of data. I want my prediction length to be a month or 30 days. The input time series would have 335 data points while the label or target time series would have 30 data points. This split must occur in time though. We cannot randomly choose 30 days out of 365 days.
prediction_length =30# Daystraining_list = []for ts in time_series_list: training_list.append(ts[:-prediction_length])for ts in training_list:print('Training set has shape {} after truncating {} days'.format(ts.shape, prediction_length))
Training set has shape (335,) after truncating 30 days
Training set has shape (336,) after truncating 30 days
Training set has shape (335,) after truncating 30 days
Before we run DeepAR on SageMaker, we need to do one final data preparation, i.e. converting the data frames into JSON format that is accepted by DeepAR.
DeepAR expects to see input training data in the following JSON fields.
start: a string that defines the starting date of the time series YYYY-MM-DD HH:MM:SS
target: a list of numerical values that represent the time series
cat: optional, a numerical array of categorical features that can be used to encode the groups that the record belongs to. This is useful for finding models per class of item.
Just as any other built-in models, SageMaker expects the JSON data to be in a S3 bucket during training and inference job.
import boto3import sagemakersession = sagemaker.Session(default_bucket='machine-learning-case-studies')role = sagemaker.get_execution_role()s3_bucket = session.default_bucket()s3_prefix ='deepar-energy-consumption'print('Instantiated session with default bucket {}'.format(s3_bucket))train_path = session.upload_data(os.path.join(local_data_dir, 'train.json'), bucket=s3_bucket, key_prefix=s3_prefix)test_path = session.upload_data(os.path.join(local_data_dir, 'test.json'), bucket=s3_bucket, key_prefix=s3_prefix)print('Training data are stored in {}'.format(train_path))print('Test data are stored in {}'.format(test_path))
Instantiated session with default bucket machine-learning-case-studies
Training data are stored in s3://machine-learning-case-studies/deepar-energy-consumption/train.json
Test data are stored in s3://machine-learning-case-studies/deepar-energy-consumption/test.json
When we provide inputs toe the fit function, if we provide a test dataset, DeepAR will calculate the accuracy metrics for the trained model. This is done by predicting the last prediction_length points of each time series in the test set and comparing it to the actual value of the time series. The computed error metrics will be included as part of the log output.
Now we have verified that the predictor works and it can capture patterns fairly well, we can use it to predict the future, i.e. the months in 2010. We will leave target empty and reserve the 2010 data for testing. In fact we could provide the historical data as target and let the model to predict the future.
start_date ='2010-01-01'# We want to predict first 30 days in 2010timestamp ='00:00:00'request_data ={'instances': [{'start':'{}{}'.format(start_date, timestamp),'target': [] } ],'configuration':{'num_samples':50,'output_types': ['quantiles'],'quantiles': ['0.1','0.5','0.9']}}predictions_2010 =decode_prediction(predictor.predict(json.dumps(request_data).encode('utf-8')))