Links

Population Segmentation with PCA and KMeans

We are deploying two unsupervised algorithms to perform population segmentation on US census data.
Using principal component analysis (PCA) to reduce the dimensionality of the original census data. Then apply k-means clustering to assign each US county to a particular cluster based on where a county lies in component space. This allows us to observe counties that are similiar to each other in socialeconomic terms.
import pandas as pd
import numpy as np
import os
import io
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import boto3
import sagemaker

Step 1 Load Data from S3

data_bucket = 'aws-ml-blog-sagemaker-census-segmentation'
s3_client = boto3.client('s3')
obj_list=s3_client.list_objects(Bucket=data_bucket)
keys=[]
for contents in obj_list['Contents']:
keys.append(contents['Key'])
# We should only get one key, which is the CSV file we want.
if len(keys) != 1:
raise RuntimeError('received unexpected number of keys from {}'.format(data_bucket))
data_object = s3_client.get_object(Bucket=data_bucket, Key=keys[0])
data_body = data_object["Body"].read() # in Bytes
data_stream = io.BytesIO(data_body)
census_df = pd.read_csv(data_stream, header=0, delimiter=",")
display(census_df.head())
Text
CensusId
State
County
TotalPop
Men
Women
Hispanic
White
Black
Native
...
Walk
OtherTransp
WorkAtHome
MeanCommute
Employed
PrivateWork
PublicWork
SelfEmployed
FamilyWork
Unemployment
0
1001
Alabama
Autauga
55221
26745
28476
2.6
75.8
18.5
0.4
...
0.5
1.3
1.8
26.5
23986
73.6
20.9
5.5
0.0
7.6
1
1003
Alabama
Baldwin
195121
95314
99807
4.5
83.1
9.5
0.6
...
1.0
1.4
3.9
26.4
85953
81.5
12.3
5.8
0.4
7.5
2
1005
Alabama
Barbour
26932
14497
12435
4.6
46.2
46.7
0.2
...
1.8
1.5
1.6
24.1
8597
71.8
20.8
7.3
0.1
17.6
3
1007
Alabama
Bibb
22604
12073
10531
2.2
74.5
21.4
0.4
...
0.6
1.5
0.7
28.8
8294
76.8
16.1
6.7
0.4
8.3
4
1009
Alabama
Blount
57710
28512
29198
8.6
87.9
1.5
0.3
...
0.9
0.4
2.3
34.9
22189
82.0
13.5
4.2
0.4
7.7
5 rows × 37 columns

Step 2 Explore & Clean Data

# Drop rows with missing data
census_df = census_df.dropna(axis=0, how='any')
# Re-index
census_df.index = census_df['State'] + '-' + census_df['County']
# Drop useless columns
census_df = census_df.drop(columns=['CensusId', 'State', 'County'])
display(census_df.head())
Text
TotalPop
Men
Women
Hispanic
White
Black
Native
Asian
Pacific
Citizen
...
Walk
OtherTransp
WorkAtHome
MeanCommute
Employed
PrivateWork
PublicWork
SelfEmployed
FamilyWork
Unemployment
Alabama-Autauga
55221
26745
28476
2.6
75.8
18.5
0.4
1.0
0.0
40725
...
0.5
1.3
1.8
26.5
23986
73.6
20.9
5.5
0.0
7.6
Alabama-Baldwin
195121
95314
99807
4.5
83.1
9.5
0.6
0.7
0.0
147695
...
1.0
1.4
3.9
26.4
85953
81.5
12.3
5.8
0.4
7.5
Alabama-Barbour
26932
14497
12435
4.6
46.2
46.7
0.2
0.4
0.0
20714
...
1.8
1.5
1.6
24.1
8597
71.8
20.8
7.3
0.1
17.6
Alabama-Bibb
22604
12073
10531
2.2
74.5
21.4
0.4
0.1
0.0
17495
...
0.6
1.5
0.7
28.8
8294
76.8
16.1
6.7
0.4
8.3
Alabama-Blount
57710
28512
29198
8.6
87.9
1.5
0.3
0.1
0.0
42345
...
0.9
0.4
2.3
34.9
22189
82.0
13.5
4.2
0.4
7.7
5 rows × 34 columns
feature_list = census_df.columns.values
print(feature_list)
['TotalPop' 'Men' 'Women' 'Hispanic' 'White' 'Black' 'Native' 'Asian'
'Pacific' 'Citizen' 'Income' 'IncomeErr' 'IncomePerCap' 'IncomePerCapErr'
'Poverty' 'ChildPoverty' 'Professional' 'Service' 'Office' 'Construction'
'Production' 'Drive' 'Carpool' 'Transit' 'Walk' 'OtherTransp'
'WorkAtHome' 'MeanCommute' 'Employed' 'PrivateWork' 'PublicWork'
'SelfEmployed' 'FamilyWork' 'Unemployment']

2.1 Visualize the Data

Use Histogram to plot the distribution of data by features.
histogram_features = ['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr']
for column_name in histogram_features:
ax=plt.subplots(figsize=(6,3))
ax = plt.hist(census_df[column_name], bins=40)
plt.title('Histogram of {}'.format(column_name), fontsize=12)
plt.show()
png
png
png
png

2.2 Normalize the Data

To get a fair comparison between values, we need to normalize the data to [0, 1].
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
norm_census_df = pd.DataFrame(scaler.fit_transform(census_df.astype(float)))
norm_census_df.columns = census_df.columns
norm_census_df.index = census_df.index
display(norm_census_df.head())
Text
TotalPop
Men
Women
Hispanic
White
Black
Native
Asian
Pacific
Citizen
...
Walk
OtherTransp
WorkAtHome
MeanCommute
Employed
PrivateWork
PublicWork
SelfEmployed
FamilyWork
Unemployment
Alabama-Autauga
0.005475
0.005381
0.005566
0.026026
0.759519
0.215367
0.004343
0.024038
0.0
0.006702
...
0.007022
0.033248
0.048387
0.552430
0.005139
0.750000
0.250000
0.150273
0.000000
0.208219
Alabama-Baldwin
0.019411
0.019246
0.019572
0.045045
0.832665
0.110594
0.006515
0.016827
0.0
0.024393
...
0.014045
0.035806
0.104839
0.549872
0.018507
0.884354
0.107616
0.158470
0.040816
0.205479
Alabama-Barbour
0.002656
0.002904
0.002416
0.046046
0.462926
0.543655
0.002172
0.009615
0.0
0.003393
...
0.025281
0.038363
0.043011
0.491049
0.001819
0.719388
0.248344
0.199454
0.010204
0.482192
Alabama-Bibb
0.002225
0.002414
0.002042
0.022022
0.746493
0.249127
0.004343
0.002404
0.0
0.002860
...
0.008427
0.038363
0.018817
0.611253
0.001754
0.804422
0.170530
0.183060
0.040816
0.227397
Alabama-Blount
0.005722
0.005738
0.005707
0.086086
0.880762
0.017462
0.003257
0.002404
0.0
0.006970
...
0.012640
0.010230
0.061828
0.767263
0.004751
0.892857
0.127483
0.114754
0.040816
0.210959
5 rows × 34 columns

Step 3 Train PCA with SageMaker

Now we can apply PCA to perform dimensionality reduction. We will use SageMaker's builtin PCA algorithm.

3.1 Prepare the Model

from sagemaker import get_execution_role
session = sagemaker.Session(default_bucket='machine-learning-case-studies')
role = get_execution_role()
session_bucket = session.default_bucket()
print('Execution role: {} \nSession Bucket: {}'.format(role, session_bucket))
Execution role: arn:aws:iam::171758673694:role/service-role/AmazonSageMaker-ExecutionRole-20200315T122350
Session Bucket: machine-learning-case-studies
prefix = 'population-segmentation'
model_output_path='s3://{}/{}/'.format(session_bucket, prefix)
print('Training artifcats will be uploaded to {}'.format(model_output_path))
Training artifcats will be uploaded to s3://machine-learning-case-studies/population-segmentation/

3.2 Train the Model

%%time
from sagemaker import PCA
NUM_COMPONENTS = len(feature_list) - 1
# Define the model
pca = PCA(role=role,
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path=model_output_path,
num_components=NUM_COMPONENTS,
sagemaker_session=session)
# Prepare data by converting into RecordSet, this will be done by the
# model.
train_data_np = norm_census_df.values.astype('float32')
train_data_record_set = pca.record_set(train_data_np)
# Train the model
from time import localtime, strftime
job_name = strftime("population-segmentation-pca-%Y-%m-%d-%H-%M-%S", localtime())
pca.fit(train_data_record_set, job_name=job_name)
2020-04-06 21:56:07 Starting - Starting the training job...
2020-04-06 21:56:08 Starting - Launching requested ML instances...
2020-04-06 21:57:06 Starting - Preparing the instances for training......
2020-04-06 21:58:01 Downloading - Downloading input data...
2020-04-06 21:58:28 Training - Downloading the training image..
...
Training seconds: 58
Billable seconds: 58
CPU times: user 498 ms, sys: 37.3 ms, total: 535 ms
Wall time: 3min 12s

3.3 Load Model Artifacts (w/o using Predict)

model_key = os.path.join(prefix, job_name, 'output/model.tar.gz')
print('Loading artifacts from {}'.format(model_key))
boto3.resource('s3').Bucket(session_bucket).download_file(model_key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
os.system('unzip model_algo-1')
Loading artifacts from population-segmentation/population-segmentation-pca-2020-04-06-21-56-07/output/model.tar.gz
2304
Many of the Amazon SageMaker algorithms use MXNet for computational speed, including PCA, and so the model artifacts are stored as an array. After the model is unzipped and decompressed, we can load the array using MXNet.
import mxnet as mx
import pprint
pca_model_params = mx.ndarray.load('model_algo-1')
pprint.pprint(pca_model_params)
{'mean':
[[0.00988273 0.00986636 0.00989863 0.11017046 0.7560245 0.10094159
0.0186819 0.02940491 0.0064698 0.01154038 0.31539047 0.1222766
0.3030056 0.08220861 0.256217 0.2964254 0.28914267 0.40191284
0.57868284 0.2854676 0.28294644 0.82774544 0.34378946 0.01576072
0.04649627 0.04115358 0.12442778 0.47014 0.00980645 0.7608103
0.19442631 0.21674445 0.0294168 0.22177474]]
<NDArray 1x34 @cpu(0)>,
's':
[1.7896362e-02 3.0864021e-02 3.2130770e-02 3.5486195e-02 9.4831578e-02
1.2699370e-01 4.0288666e-01 1.4084760e+00 1.5100485e+00 1.5957943e+00
1.7783760e+00 2.1662524e+00 2.2966361e+00 2.3856051e+00 2.6954880e+00
2.8067985e+00 3.0175958e+00 3.3952675e+00 3.5731301e+00 3.6966958e+00
4.1890211e+00 4.3457499e+00 4.5410376e+00 5.0189657e+00 5.5786467e+00
5.9809699e+00 6.3925138e+00 7.6952214e+00 7.9913125e+00 1.0180052e+01
1.1718245e+01 1.3035975e+01 1.9592180e+01]
<NDArray 33 @cpu(0)>,
'v':
[[ 2.46869749e-03 2.56468095e-02 2.50773830e-03 ... -7.63925165e-02
1.59879066e-02 5.04589686e-03]
[-2.80601848e-02 -6.86634064e-01 -1.96283013e-02 ... -7.59587288e-02
1.57304872e-02 4.95312130e-03]
[ 3.25766727e-02 7.17300594e-01 2.40726061e-02 ... -7.68136829e-02
1.62378680e-02 5.13597298e-03]
...
[ 1.12151138e-01 -1.17030945e-02 -2.88011521e-01 ... 1.39890045e-01
-3.09406728e-01 -6.34506866e-02]
[ 2.99992133e-02 -3.13433539e-03 -7.63589665e-02 ... 4.17341813e-02
-7.06735924e-02 -1.42857227e-02]
[ 7.33537527e-05 3.01008171e-04 -8.00925500e-06 ... 6.97060227e-02
1.20169498e-01 2.33626723e-01]]
<NDArray 34x33 @cpu(0)>}

3.4 PCA Model Attributes

Three types of model attributes are contained within the PCA model.
  • mean: The mean that was subtracted from a component in order to center it.
  • v: The makeup of the principal components; (same as ‘components_’ in an sklearn PCA model).
  • s: The singular values of the components for the PCA transformation.
The singular values do not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.

Explained Variance

From s, we can get an approximation of the data variance that is covered in the first n principal components. The approximate explained variance is given by the formula: the sum of squared s values for all top n components over the sum over squared s values for all components:
\begin{equation*} \frac{\sum_{n}^{ } s_n^2}{\sum s^2} \end{equation*}
From v, we can learn more about the combinations of original features that make up each principal component.
# Get selected params
s=pd.DataFrame(pca_model_params['s'].asnumpy())
v=pd.DataFrame(pca_model_params['v'].asnumpy())
# The top principal components are actually at the end of the data frame.
# Ordered by least significance to most sigificance.
# Let's look at top 5
display(s.iloc[NUM_COMPONENTS-5:, :])
Text
0
28
7.991313
29
10.180052
30
11.718245
31
13.035975
32
19.592180
def explained_variance(s, n_top_components):
'''
:param s: A dataframe of singular values for top components;
the top value is in the last row.
:param n_top_components: An integer, the number of top components to use.
:return: The expected data variance covered by the n_top_components.
'''
start_idx = len(s) - n_top_components
denom = np.square(s).sum()[0]
numer = np.square(s.iloc[start_idx:, :]).sum()[0]
return numer/denom
for n in range(1, 10):
print('Explained variance with {} top components: {}'.format(n, explained_variance(s, n)))
Explained variance with 1 top components: 0.32098713517189026
Explained variance with 2 top components: 0.46309205889701843
Explained variance with 3 top components: 0.5779199004173279
Explained variance with 4 top components: 0.6645806431770325
Explained variance with 5 top components: 0.7179827094078064
Explained variance with 6 top components: 0.7675008773803711
Explained variance with 7 top components: 0.8016724586486816
Explained variance with 8 top components: 0.8315857648849487
Explained variance with 9 top components: 0.8576101660728455

3.5 Examine Component Makeup

We can now examine the makeup of each PCA component based on the weights of the original features that are included in the component.
import seaborn
def display_component_makeup(v, feature_list, component_idx, num_weights=10):
component = v.iloc[:, NUM_COMPONENTS-component_idx] # Reversed
component_values = np.squeeze(component.values) # Remove redundant arrays
component_df = pd.DataFrame(data=list(zip(component_values, feature_list)),
columns=['weights', 'features'])
# Create a new column for absolute weight, so we can sort it
component_df['abs_weights'] = component_df['weights'].apply(lambda x: np.abs(x))
sorted_weight_data = component_df.sort_values('abs_weights', ascending=False).head(num_weights)
ax = plt.subplots(figsize=(10, 6))
ax = seaborn.barplot(data=sorted_weight_data,
x='weights',
y='features')
ax.set_title('PCA Component Makeup of Component {}'.format(component_idx))
plt.show()
Each component is a linearly independent vector. The component represents a new basis in a projected component space. When we compare the feature weights, we are effectively asking, for this new basis vector, what is its corrrelation to the original feature space? For example, component 1 is negatively correlated with White percentage in the census data. Component 2 is positvely correlated with PrivateWork percentage in census data.
for idx in range(1, 3):
display_component_makeup(v, feature_list=norm_census_df.columns.values,
component_idx=idx,
num_weights=33)
png
png

Step 4 Deploy PCA with SageMaker

Now we can deploy the PCA model and use it as an endpoint without having to dig into the model params to transform an input.
%%time
pca_predictor = pca.deploy(initial_instance_count=1,
instance_type='ml.t2.medium')
-------------!CPU times: user 238 ms, sys: 15.3 ms, total: 254 ms
Wall time: 6min 33s
We don't need to use RecordSet anymore once the model is deployed. It will simply accept a numpy array to yield principal components.
pca_result = pca_predictor.predict(train_data_np)
SageMaker PCA returns a list of protobuf Record message, same length as the training data which is 3218 in this case. The protobuf Record message has the following format.
message Record {
// Map from the name of the feature to the value.
//
// For vectors and libsvm-like datasets,
// a single feature with the name `values`
// should be specified.
map<string, Value> features = 1;
// An optional set of labels for this record.
// Similar to the features field above, the key used for
// generic scalar / vector labels should be 'values'.
map<string, Value> label = 2;
// A unique identifier for this record in the dataset.
//
// Whilst not necessary, this allows better
// debugging where there are data issues.
//
// This is not used by the algorithm directly.
optional string uid = 3;
// Textual metadata describing the record.
//
// This may include JSON-serialized information
// about the source of the record.
//
// This is not used by the algorithm directly.
optional string metadata = 4;
// An optional serialized JSON object that allows per-record
// hyper-parameters/configuration/other information to be set.
//
// The meaning/interpretation of this field is defined by
// the algorithm author and may not be supported.
//
// This is used to pass additional inference configuration
// when batch inference is used (e.g. types of scores to return).
optional string configuration = 5;
}
Essentially each data point is now projected onto a new component space. We can retrieve the projection by using the following syntax.
pca_result[0].label['projection'].float32_tensor.values
[0.0002009272575378418, 0.0002455431967973709, -0.0005782842636108398, -0.0007815659046173096, -0.00041911262087523937, -0.0005133943632245064, -0.0011316537857055664, 0.0017268601804971695, -0.005361668765544891, -0.009066537022590637, -0.008141040802001953, -0.004735097289085388, -0.00716288760304451, 0.0003725700080394745, -0.01208949089050293, 0.02134685218334198, 0.0009293854236602783, 0.002417147159576416, -0.0034637749195098877, 0.01794189214706421, -0.01639425754547119, 0.06260128319263458, 0.06637358665466309, 0.002479255199432373, 0.10011336207389832, -0.1136140376329422, 0.02589476853609085, 0.04045158624649048, -0.01082391943782568, 0.1204797774553299, -0.0883558839559555, 0.16052711009979248, -0.06027412414550781]
Now we can transform the PCA result into a DataFrame that we can work with.
def create_pca_transformed_df(original_data, pca_transformed_data, num_top_components):
"""
Return a DataFrame of data points with component features. The DF should be indexed by State-County and
contain component values.
"""
df = pd.DataFrame()
for record in pca_transformed_data:
projection_values = record.label['projection'].float32_tensor.values
df = df.append([projection_values])
df.index = original_data.index
df = df.iloc[:, NUM_COMPONENTS-num_top_components:]
return df.iloc[:, ::-1] # Reverse the ordering, such that most important component is shown first.
# Manually pick top 7 components.
pca_transformed_census_df = create_pca_transformed_df(norm_census_df, pca_result, 7)
pca_transformed_census_df.columns = ['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7']
pca_transformed_census_df.head()
Text
c1
c2
c3
c4
c5
c6