Population Segmentation with PCA and KMeans
We are deploying two unsupervised algorithms to perform population segmentation on US census data.
Using principal component analysis (PCA) to reduce the dimensionality of the original census data. Then apply k-means clustering to assign each US county to a particular cluster based on where a county lies in component space. This allows us to observe counties that are similiar to each other in socialeconomic terms.
import pandas as pd
import numpy as np
import os
import io
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import boto3
import sagemaker
data_bucket = 'aws-ml-blog-sagemaker-census-segmentation'
s3_client = boto3.client('s3')
obj_list=s3_client.list_objects(Bucket=data_bucket)
keys=[]
for contents in obj_list['Contents']:
keys.append(contents['Key'])
# We should only get one key, which is the CSV file we want.
if len(keys) != 1:
raise RuntimeError('received unexpected number of keys from {}'.format(data_bucket))
data_object = s3_client.get_object(Bucket=data_bucket, Key=keys[0])
data_body = data_object["Body"].read() # in Bytes
data_stream = io.BytesIO(data_body)
census_df = pd.read_csv(data_stream, header=0, delimiter=",")
display(census_df.head())
Text | CensusId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | ... | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | Alabama | Autauga | 55221 | 26745 | 28476 | 2.6 | 75.8 | 18.5 | 0.4 | ... | 0.5 | 1.3 | 1.8 | 26.5 | 23986 | 73.6 | 20.9 | 5.5 | 0.0 | 7.6 |
1 | 1003 | Alabama | Baldwin | 195121 | 95314 | 99807 | 4.5 | 83.1 | 9.5 | 0.6 | ... | 1.0 | 1.4 | 3.9 | 26.4 | 85953 | 81.5 | 12.3 | 5.8 | 0.4 | 7.5 |
2 | 1005 | Alabama | Barbour | 26932 | 14497 | 12435 | 4.6 | 46.2 | 46.7 | 0.2 | ... | 1.8 | 1.5 | 1.6 | 24.1 | 8597 | 71.8 | 20.8 | 7.3 | 0.1 | 17.6 |
3 | 1007 | Alabama | Bibb | 22604 | 12073 | 10531 | 2.2 | 74.5 | 21.4 | 0.4 | ... | 0.6 | 1.5 | 0.7 | 28.8 | 8294 | 76.8 | 16.1 | 6.7 | 0.4 | 8.3 |
4 | 1009 | Alabama | Blount | 57710 | 28512 | 29198 | 8.6 | 87.9 | 1.5 | 0.3 | ... | 0.9 | 0.4 | 2.3 | 34.9 | 22189 | 82.0 | 13.5 | 4.2 | 0.4 | 7.7 |
5 rows × 37 columns
# Drop rows with missing data
census_df = census_df.dropna(axis=0, how='any')
# Re-index
census_df.index = census_df['State'] + '-' + census_df['County']
# Drop useless columns
census_df = census_df.drop(columns=['CensusId', 'State', 'County'])
display(census_df.head())
Text | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | ... | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama-Autauga | 55221 | 26745 | 28476 | 2.6 | 75.8 | 18.5 | 0.4 | 1.0 | 0.0 | 40725 | ... | 0.5 | 1.3 | 1.8 | 26.5 | 23986 | 73.6 | 20.9 | 5.5 | 0.0 | 7.6 |
Alabama-Baldwin | 195121 | 95314 | 99807 | 4.5 | 83.1 | 9.5 | 0.6 | 0.7 | 0.0 | 147695 | ... | 1.0 | 1.4 | 3.9 | 26.4 | 85953 | 81.5 | 12.3 | 5.8 | 0.4 | 7.5 |
Alabama-Barbour | 26932 | 14497 | 12435 | 4.6 | 46.2 | 46.7 | 0.2 | 0.4 | 0.0 | 20714 | ... | 1.8 | 1.5 | 1.6 | 24.1 | 8597 | 71.8 | 20.8 | 7.3 | 0.1 | 17.6 |
Alabama-Bibb | 22604 | 12073 | 10531 | 2.2 | 74.5 | 21.4 | 0.4 | 0.1 | 0.0 | 17495 | ... | 0.6 | 1.5 | 0.7 | 28.8 | 8294 | 76.8 | 16.1 | 6.7 | 0.4 | 8.3 |
Alabama-Blount | 57710 | 28512 | 29198 | 8.6 | 87.9 | 1.5 | 0.3 | 0.1 | 0.0 | 42345 | ... | 0.9 | 0.4 | 2.3 | 34.9 | 22189 | 82.0 | 13.5 | 4.2 | 0.4 | 7.7 |
5 rows × 34 columns
feature_list = census_df.columns.values
print(feature_list)
['TotalPop' 'Men' 'Women' 'Hispanic' 'White' 'Black' 'Native' 'Asian'
'Pacific' 'Citizen' 'Income' 'IncomeErr' 'IncomePerCap' 'IncomePerCapErr'
'Poverty' 'ChildPoverty' 'Professional' 'Service' 'Office' 'Construction'
'Production' 'Drive' 'Carpool' 'Transit' 'Walk' 'OtherTransp'
'WorkAtHome' 'MeanCommute' 'Employed' 'PrivateWork' 'PublicWork'
'SelfEmployed' 'FamilyWork' 'Unemployment']
Use Histogram to plot the distribution of data by features.
histogram_features = ['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr']
for column_name in histogram_features:
ax=plt.subplots(figsize=(6,3))
ax = plt.hist(census_df[column_name], bins=40)
plt.title('Histogram of {}'.format(column_name), fontsize=12)
plt.show()

png

png

png

png
To get a fair comparison between values, we need to normalize the data to [0, 1].
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
norm_census_df = pd.DataFrame(scaler.fit_transform(census_df.astype(float)))
norm_census_df.columns = census_df.columns
norm_census_df.index = census_df.index
display(norm_census_df.head())
Text | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | ... | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama-Autauga | 0.005475 | 0.005381 | 0.005566 | 0.026026 | 0.759519 | 0.215367 | 0.004343 | 0.024038 | 0.0 | 0.006702 | ... | 0.007022 | 0.033248 | 0.048387 | 0.552430 | 0.005139 | 0.750000 | 0.250000 | 0.150273 | 0.000000 | 0.208219 |
Alabama-Baldwin | 0.019411 | 0.019246 | 0.019572 | 0.045045 | 0.832665 | 0.110594 | 0.006515 | 0.016827 | 0.0 | 0.024393 | ... | 0.014045 | 0.035806 | 0.104839 | 0.549872 | 0.018507 | 0.884354 | 0.107616 | 0.158470 | 0.040816 | 0.205479 |
Alabama-Barbour | 0.002656 | 0.002904 | 0.002416 | 0.046046 | 0.462926 | 0.543655 | 0.002172 | 0.009615 | 0.0 | 0.003393 | ... | 0.025281 | 0.038363 | 0.043011 | 0.491049 | 0.001819 | 0.719388 | 0.248344 | 0.199454 | 0.010204 | 0.482192 |
Alabama-Bibb | 0.002225 | 0.002414 | 0.002042 | 0.022022 | 0.746493 | 0.249127 | 0.004343 | 0.002404 | 0.0 | 0.002860 | ... | 0.008427 | 0.038363 | 0.018817 | 0.611253 | 0.001754 | 0.804422 | 0.170530 | 0.183060 | 0.040816 | 0.227397 |
Alabama-Blount | 0.005722 | 0.005738 | 0.005707 | 0.086086 | 0.880762 | 0.017462 | 0.003257 | 0.002404 | 0.0 | 0.006970 | ... | 0.012640 | 0.010230 | 0.061828 | 0.767263 | 0.004751 | 0.892857 | 0.127483 | 0.114754 | 0.040816 | 0.210959 |
5 rows × 34 columns
Now we can apply PCA to perform dimensionality reduction. We will use SageMaker's builtin PCA algorithm.
from sagemaker import get_execution_role
session = sagemaker.Session(default_bucket='machine-learning-case-studies')
role = get_execution_role()
session_bucket = session.default_bucket()
print('Execution role: {} \nSession Bucket: {}'.format(role, session_bucket))
Execution role: arn:aws:iam::171758673694:role/service-role/AmazonSageMaker-ExecutionRole-20200315T122350
Session Bucket: machine-learning-case-studies
prefix = 'population-segmentation'
model_output_path='s3://{}/{}/'.format(session_bucket, prefix)
print('Training artifcats will be uploaded to {}'.format(model_output_path))
Training artifcats will be uploaded to s3://machine-learning-case-studies/population-segmentation/
%%time
from sagemaker import PCA
NUM_COMPONENTS = len(feature_list) - 1
# Define the model
pca = PCA(role=role,
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path=model_output_path,
num_components=NUM_COMPONENTS,
sagemaker_session=session)
# Prepare data by converting into RecordSet, this will be done by the
# model.
train_data_np = norm_census_df.values.astype('float32')
train_data_record_set = pca.record_set(train_data_np)
# Train the model
from time import localtime, strftime
job_name = strftime("population-segmentation-pca-%Y-%m-%d-%H-%M-%S", localtime())
pca.fit(train_data_record_set, job_name=job_name)
2020-04-06 21:56:07 Starting - Starting the training job...
2020-04-06 21:56:08 Starting - Launching requested ML instances...
2020-04-06 21:57:06 Starting - Preparing the instances for training......
2020-04-06 21:58:01 Downloading - Downloading input data...
2020-04-06 21:58:28 Training - Downloading the training image..
...
Training seconds: 58
Billable seconds: 58
CPU times: user 498 ms, sys: 37.3 ms, total: 535 ms
Wall time: 3min 12s
model_key = os.path.join(prefix, job_name, 'output/model.tar.gz')
print('Loading artifacts from {}'.format(model_key))
boto3.resource('s3').Bucket(session_bucket).download_file(model_key, 'model.tar.gz')
os.system('tar -zxvf model.tar.gz')
os.system('unzip model_algo-1')
Loading artifacts from population-segmentation/population-segmentation-pca-2020-04-06-21-56-07/output/model.tar.gz
2304
Many of the Amazon SageMaker algorithms use MXNet for computational speed, including PCA, and so the model artifacts are stored as an array. After the model is unzipped and decompressed, we can load the array using MXNet.
import mxnet as mx
import pprint
pca_model_params = mx.ndarray.load('model_algo-1')
pprint.pprint(pca_model_params)
{'mean':
[[0.00988273 0.00986636 0.00989863 0.11017046 0.7560245 0.10094159
0.0186819 0.02940491 0.0064698 0.01154038 0.31539047 0.1222766
0.3030056 0.08220861 0.256217 0.2964254 0.28914267 0.40191284
0.57868284 0.2854676 0.28294644 0.82774544 0.34378946 0.01576072
0.04649627 0.04115358 0.12442778 0.47014 0.00980645 0.7608103
0.19442631 0.21674445 0.0294168 0.22177474]]
<NDArray 1x34 @cpu(0)>,
's':
[1.7896362e-02 3.0864021e-02 3.2130770e-02 3.5486195e-02 9.4831578e-02
1.2699370e-01 4.0288666e-01 1.4084760e+00 1.5100485e+00 1.5957943e+00
1.7783760e+00 2.1662524e+00 2.2966361e+00 2.3856051e+00 2.6954880e+00
2.8067985e+00 3.0175958e+00 3.3952675e+00 3.5731301e+00 3.6966958e+00
4.1890211e+00 4.3457499e+00 4.5410376e+00 5.0189657e+00 5.5786467e+00
5.9809699e+00 6.3925138e+00 7.6952214e+00 7.9913125e+00 1.0180052e+01
1.1718245e+01 1.3035975e+01 1.9592180e+01]
<NDArray 33 @cpu(0)>,
'v':
[[ 2.46869749e-03 2.56468095e-02 2.50773830e-03 ... -7.63925165e-02
1.59879066e-02 5.04589686e-03]
[-2.80601848e-02 -6.86634064e-01 -1.96283013e-02 ... -7.59587288e-02
1.57304872e-02 4.95312130e-03]
[ 3.25766727e-02 7.17300594e-01 2.40726061e-02 ... -7.68136829e-02
1.62378680e-02 5.13597298e-03]
...
[ 1.12151138e-01 -1.17030945e-02 -2.88011521e-01 ... 1.39890045e-01
-3.09406728e-01 -6.34506866e-02]
[ 2.99992133e-02 -3.13433539e-03 -7.63589665e-02 ... 4.17341813e-02
-7.06735924e-02 -1.42857227e-02]
[ 7.33537527e-05 3.01008171e-04 -8.00925500e-06 ... 6.97060227e-02
1.20169498e-01 2.33626723e-01]]
<NDArray 34x33 @cpu(0)>}
Three types of model attributes are contained within the PCA model.
- mean: The mean that was subtracted from a component in order to center it.
- v: The makeup of the principal components; (same as ‘components_’ in an sklearn PCA model).
- s: The singular values of the components for the PCA transformation.
The singular values do not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.
From s, we can get an approximation of the data variance that is covered in the first
n
principal components. The approximate explained variance is given by the formula: the sum of squared s values for all top n components over the sum over squared s values for all components:\begin{equation*} \frac{\sum_{n}^{ } s_n^2}{\sum s^2} \end{equation*}
From v, we can learn more about the combinations of original features that make up each principal component.
# Get selected params
s=pd.DataFrame(pca_model_params['s'].asnumpy())
v=pd.DataFrame(pca_model_params['v'].asnumpy())
# The top principal components are actually at the end of the data frame.
# Ordered by least significance to most sigificance.