We are deploying two unsupervised algorithms to perform population segmentation on US census data.
Using principal component analysis (PCA) to reduce the dimensionality of the original census data. Then apply k-means clustering to assign each US county to a particular cluster based on where a county lies in component space. This allows us to observe counties that are similiar to each other in socialeconomic terms.
import pandas as pdimport numpy as npimport osimport ioimport matplotlib.pyplot as pltimport matplotlib%matplotlib inlineimport boto3import sagemaker
Step 1 Load Data from S3
data_bucket ='aws-ml-blog-sagemaker-census-segmentation's3_client = boto3.client('s3')obj_list=s3_client.list_objects(Bucket=data_bucket)keys=[]for contents in obj_list['Contents']: keys.append(contents['Key'])# We should only get one key, which is the CSV file we want.iflen(keys)!=1:raiseRuntimeError('received unexpected number of keys from {}'.format(data_bucket))data_object = s3_client.get_object(Bucket=data_bucket, Key=keys[0])data_body = data_object["Body"].read()# in Bytesdata_stream = io.BytesIO(data_body)census_df = pd.read_csv(data_stream, header=0, delimiter=",")display(census_df.head())
CensusId
State
County
TotalPop
Men
Women
Hispanic
White
Black
Native
...
Walk
OtherTransp
WorkAtHome
MeanCommute
Employed
PrivateWork
PublicWork
SelfEmployed
FamilyWork
Unemployment
0
1001
Alabama
Autauga
55221
26745
28476
2.6
75.8
18.5
0.4
...
0.5
1.3
1.8
26.5
23986
73.6
20.9
5.5
0.0
7.6
1
1003
Alabama
Baldwin
195121
95314
99807
4.5
83.1
9.5
0.6
...
1.0
1.4
3.9
26.4
85953
81.5
12.3
5.8
0.4
7.5
2
1005
Alabama
Barbour
26932
14497
12435
4.6
46.2
46.7
0.2
...
1.8
1.5
1.6
24.1
8597
71.8
20.8
7.3
0.1
17.6
3
1007
Alabama
Bibb
22604
12073
10531
2.2
74.5
21.4
0.4
...
0.6
1.5
0.7
28.8
8294
76.8
16.1
6.7
0.4
8.3
4
1009
Alabama
Blount
57710
28512
29198
8.6
87.9
1.5
0.3
...
0.9
0.4
2.3
34.9
22189
82.0
13.5
4.2
0.4
7.7
5 rows × 37 columns
Step 2 Explore & Clean Data
# Drop rows with missing datacensus_df = census_df.dropna(axis=0, how='any')# Re-indexcensus_df.index = census_df['State']+'-'+ census_df['County']# Drop useless columnscensus_df = census_df.drop(columns=['CensusId', 'State', 'County'])display(census_df.head())
prefix ='population-segmentation'model_output_path='s3://{}/{}/'.format(session_bucket, prefix)print('Training artifcats will be uploaded to {}'.format(model_output_path))
Training artifcats will be uploaded to s3://machine-learning-case-studies/population-segmentation/
3.2 Train the Model
%%timefrom sagemaker import PCANUM_COMPONENTS =len(feature_list)-1# Define the modelpca =PCA(role=role, train_instance_count=1, train_instance_type='ml.c4.xlarge', output_path=model_output_path, num_components=NUM_COMPONENTS, sagemaker_session=session)# Prepare data by converting into RecordSet, this will be done by the# model.train_data_np = norm_census_df.values.astype('float32')train_data_record_set = pca.record_set(train_data_np)# Train the modelfrom time import localtime, strftimejob_name =strftime("population-segmentation-pca-%Y-%m-%d-%H-%M-%S", localtime())pca.fit(train_data_record_set, job_name=job_name)
2020-04-06 21:56:07 Starting - Starting the training job...
2020-04-06 21:56:08 Starting - Launching requested ML instances...
2020-04-06 21:57:06 Starting - Preparing the instances for training......
2020-04-06 21:58:01 Downloading - Downloading input data...
2020-04-06 21:58:28 Training - Downloading the training image..
...
Training seconds: 58
Billable seconds: 58
CPU times: user 498 ms, sys: 37.3 ms, total: 535 ms
Wall time: 3min 12s
Loading artifacts from population-segmentation/population-segmentation-pca-2020-04-06-21-56-07/output/model.tar.gz
2304
Many of the Amazon SageMaker algorithms use MXNet for computational speed, including PCA, and so the model artifacts are stored as an array. After the model is unzipped and decompressed, we can load the array using MXNet.
import mxnet as mximport pprintpca_model_params = mx.ndarray.load('model_algo-1')pprint.pprint(pca_model_params)
Three types of model attributes are contained within the PCA model.
mean: The mean that was subtracted from a component in order to center it.
v: The makeup of the principal components; (same as ‘components_’ in an sklearn PCA model).
s: The singular values of the components for the PCA transformation.
The singular values do not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.
Explained Variance
From s, we can get an approximation of the data variance that is covered in the first n principal components. The approximate explained variance is given by the formula: the sum of squared s values for all top n components over the sum over squared s values for all components:
From v, we can learn more about the combinations of original features that make up each principal component.
# Get selected paramss=pd.DataFrame(pca_model_params['s'].asnumpy())v=pd.DataFrame(pca_model_params['v'].asnumpy())# The top principal components are actually at the end of the data frame.# Ordered by least significance to most sigificance.# Let's look at top 5display(s.iloc[NUM_COMPONENTS-5:, :])
0
28
7.991313
29
10.180052
30
11.718245
31
13.035975
32
19.592180
defexplained_variance(s,n_top_components):''' :param s: A dataframe of singular values for top components; the top value is in the last row. :param n_top_components: An integer, the number of top components to use. :return: The expected data variance covered by the n_top_components. ''' start_idx =len(s)- n_top_components denom = np.square(s).sum()[0] numer = np.square(s.iloc[start_idx:, :]).sum()[0]return numer/denomfor n inrange(1, 10):print('Explained variance with {} top components: {}'.format(n, explained_variance(s, n)))
Explained variance with 1 top components: 0.32098713517189026
Explained variance with 2 top components: 0.46309205889701843
Explained variance with 3 top components: 0.5779199004173279
Explained variance with 4 top components: 0.6645806431770325
Explained variance with 5 top components: 0.7179827094078064
Explained variance with 6 top components: 0.7675008773803711
Explained variance with 7 top components: 0.8016724586486816
Explained variance with 8 top components: 0.8315857648849487
Explained variance with 9 top components: 0.8576101660728455
3.5 Examine Component Makeup
We can now examine the makeup of each PCA component based on the weights of the original features that are included in the component.
import seaborndefdisplay_component_makeup(v,feature_list,component_idx,num_weights=10): component = v.iloc[:, NUM_COMPONENTS-component_idx]# Reversed component_values = np.squeeze(component.values)# Remove redundant arrays component_df = pd.DataFrame(data=list(zip(component_values, feature_list)), columns=['weights', 'features'])# Create a new column for absolute weight, so we can sort it component_df['abs_weights']= component_df['weights'].apply(lambdax: np.abs(x)) sorted_weight_data = component_df.sort_values('abs_weights', ascending=False).head(num_weights) ax = plt.subplots(figsize=(10, 6)) ax = seaborn.barplot(data=sorted_weight_data, x='weights', y='features') ax.set_title('PCA Component Makeup of Component {}'.format(component_idx)) plt.show()
Each component is a linearly independent vector. The component represents a new basis in a projected component space. When we compare the feature weights, we are effectively asking, for this new basis vector, what is its corrrelation to the original feature space? For example, component 1 is negatively correlated with White percentage in the census data. Component 2 is positvely correlated with PrivateWork percentage in census data.
for idx inrange(1, 3):display_component_makeup(v, feature_list=norm_census_df.columns.values, component_idx=idx, num_weights=33)
Step 4 Deploy PCA with SageMaker
Now we can deploy the PCA model and use it as an endpoint without having to dig into the model params to transform an input.
-------------!CPU times: user 238 ms, sys: 15.3 ms, total: 254 ms
Wall time: 6min 33s
We don't need to use RecordSet anymore once the model is deployed. It will simply accept a numpy array to yield principal components.
pca_result = pca_predictor.predict(train_data_np)
SageMaker PCA returns a list of protobuf Record message, same length as the training data which is 3218 in this case. The protobuf Record message has the following format.
message Record {// Map from the name of the feature to the value.//// For vectors and libsvm-like datasets,// a single feature with the name `values`// should be specified.map<string, Value> features =1;// An optional set of labels for this record.// Similar to the features field above, the key used for// generic scalar / vector labels should be 'values'.map<string, Value> label =2;// A unique identifier for this record in the dataset.//// Whilst not necessary, this allows better// debugging where there are data issues.//// This is not used by the algorithm directly.optionalstring uid =3;// Textual metadata describing the record.//// This may include JSON-serialized information// about the source of the record.//// This is not used by the algorithm directly.optionalstring metadata =4;// An optional serialized JSON object that allows per-record// hyper-parameters/configuration/other information to be set.//// The meaning/interpretation of this field is defined by// the algorithm author and may not be supported.//// This is used to pass additional inference configuration// when batch inference is used (e.g. types of scores to return).optionalstring configuration =5;}
Essentially each data point is now projected onto a new component space. We can retrieve the projection by using the following syntax.
Now we can transform the PCA result into a DataFrame that we can work with.
defcreate_pca_transformed_df(original_data,pca_transformed_data,num_top_components):""" Return a DataFrame of data points with component features. The DF should be indexed by State-County and contain component values. """ df = pd.DataFrame()for record in pca_transformed_data: projection_values = record.label['projection'].float32_tensor.values df = df.append([projection_values]) df.index = original_data.index df = df.iloc[:, NUM_COMPONENTS-num_top_components:]return df.iloc[:,::-1]# Reverse the ordering, such that most important component is shown first.
2020-04-06 23:43:44 Starting - Starting the training job...
2020-04-06 23:43:45 Starting - Launching requested ML instances......
2020-04-06 23:44:45 Starting - Preparing the instances for training......
2020-04-06 23:46:05 Downloading - Downloading input data...
2020-04-06 23:46:40 Training - Training image download completed. Training in progress..
...
2020-04-06 23:46:52 Uploading - Uploading generated training model
2020-04-06 23:46:52 Completed - Training job completed
Training seconds: 47
Billable seconds: 47
%%time# Also deploy itkmeans_predictor = kmeans.deploy(initial_instance_count=1, instance_type='ml.t2.medium')
-----------------------------!CPU times: user 459 ms, sys: 20.4 ms, total: 479 ms
Wall time: 14min 36s
Step 6 Clustering
Now we can perform clustering and explore the result of clustering.
Let's see which cluster is each data point assigned to. We can simply randomly select few data points by indices and check their cluster information using the same indices.
import randomfor i inrange(3): data_index = random.randint(0,len(train_data_np))print('County: {}\n'.format(pca_transformed_census_df.index[data_index]))print(cluster_info[data_index])
We can't visualize a 7-dimensional centroid in Cartesian space, but we can plot a heatmap of the centroids and their location in transformed feature space.
plt.figure(figsize=(12,9))ax = seaborn.heatmap(centroid_df.T, cmap='YlGnBu')ax.set_xlabel('Cluster Number')ax.set_title('Attribute Value by Centroid')plt.yticks(fontsize=16)plt.xticks(fontsize=16)plt.show()
6.4 Examine the Grouping
Finally we should map the labels back to the census transformed DataFrame to understand the grouping of different counties.