Population Segmentation with PCA and KMeans
We are deploying two unsupervised algorithms to perform population segmentation on US census data.
Using principal component analysis (PCA) to reduce the dimensionality of the original census data. Then apply k-means clustering to assign each US county to a particular cluster based on where a county lies in component space. This allows us to observe counties that are similiar to each other in socialeconomic terms.
Step 1 Load Data from S3
CensusId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | ... | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | Alabama | Autauga | 55221 | 26745 | 28476 | 2.6 | 75.8 | 18.5 | 0.4 | ... | 0.5 | 1.3 | 1.8 | 26.5 | 23986 | 73.6 | 20.9 | 5.5 | 0.0 | 7.6 |
1 | 1003 | Alabama | Baldwin | 195121 | 95314 | 99807 | 4.5 | 83.1 | 9.5 | 0.6 | ... | 1.0 | 1.4 | 3.9 | 26.4 | 85953 | 81.5 | 12.3 | 5.8 | 0.4 | 7.5 |
2 | 1005 | Alabama | Barbour | 26932 | 14497 | 12435 | 4.6 | 46.2 | 46.7 | 0.2 | ... | 1.8 | 1.5 | 1.6 | 24.1 | 8597 | 71.8 | 20.8 | 7.3 | 0.1 | 17.6 |
3 | 1007 | Alabama | Bibb | 22604 | 12073 | 10531 | 2.2 | 74.5 | 21.4 | 0.4 | ... | 0.6 | 1.5 | 0.7 | 28.8 | 8294 | 76.8 | 16.1 | 6.7 | 0.4 | 8.3 |
4 | 1009 | Alabama | Blount | 57710 | 28512 | 29198 | 8.6 | 87.9 | 1.5 | 0.3 | ... | 0.9 | 0.4 | 2.3 | 34.9 | 22189 | 82.0 | 13.5 | 4.2 | 0.4 | 7.7 |
5 rows × 37 columns
Step 2 Explore & Clean Data
TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | ... | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama-Autauga | 55221 | 26745 | 28476 | 2.6 | 75.8 | 18.5 | 0.4 | 1.0 | 0.0 | 40725 | ... | 0.5 | 1.3 | 1.8 | 26.5 | 23986 | 73.6 | 20.9 | 5.5 | 0.0 | 7.6 |
Alabama-Baldwin | 195121 | 95314 | 99807 | 4.5 | 83.1 | 9.5 | 0.6 | 0.7 | 0.0 | 147695 | ... | 1.0 | 1.4 | 3.9 | 26.4 | 85953 | 81.5 | 12.3 | 5.8 | 0.4 | 7.5 |
Alabama-Barbour | 26932 | 14497 | 12435 | 4.6 | 46.2 | 46.7 | 0.2 | 0.4 | 0.0 | 20714 | ... | 1.8 | 1.5 | 1.6 | 24.1 | 8597 | 71.8 | 20.8 | 7.3 | 0.1 | 17.6 |
Alabama-Bibb | 22604 | 12073 | 10531 | 2.2 | 74.5 | 21.4 | 0.4 | 0.1 | 0.0 | 17495 | ... | 0.6 | 1.5 | 0.7 | 28.8 | 8294 | 76.8 | 16.1 | 6.7 | 0.4 | 8.3 |
Alabama-Blount | 57710 | 28512 | 29198 | 8.6 | 87.9 | 1.5 | 0.3 | 0.1 | 0.0 | 42345 | ... | 0.9 | 0.4 | 2.3 | 34.9 | 22189 | 82.0 | 13.5 | 4.2 | 0.4 | 7.7 |
5 rows × 34 columns
2.1 Visualize the Data
Use Histogram to plot the distribution of data by features.
2.2 Normalize the Data
To get a fair comparison between values, we need to normalize the data to [0, 1].
TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | Citizen | ... | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama-Autauga | 0.005475 | 0.005381 | 0.005566 | 0.026026 | 0.759519 | 0.215367 | 0.004343 | 0.024038 | 0.0 | 0.006702 | ... | 0.007022 | 0.033248 | 0.048387 | 0.552430 | 0.005139 | 0.750000 | 0.250000 | 0.150273 | 0.000000 | 0.208219 |
Alabama-Baldwin | 0.019411 | 0.019246 | 0.019572 | 0.045045 | 0.832665 | 0.110594 | 0.006515 | 0.016827 | 0.0 | 0.024393 | ... | 0.014045 | 0.035806 | 0.104839 | 0.549872 | 0.018507 | 0.884354 | 0.107616 | 0.158470 | 0.040816 | 0.205479 |
Alabama-Barbour | 0.002656 | 0.002904 | 0.002416 | 0.046046 | 0.462926 | 0.543655 | 0.002172 | 0.009615 | 0.0 | 0.003393 | ... | 0.025281 | 0.038363 | 0.043011 | 0.491049 | 0.001819 | 0.719388 | 0.248344 | 0.199454 | 0.010204 | 0.482192 |
Alabama-Bibb | 0.002225 | 0.002414 | 0.002042 | 0.022022 | 0.746493 | 0.249127 | 0.004343 | 0.002404 | 0.0 | 0.002860 | ... | 0.008427 | 0.038363 | 0.018817 | 0.611253 | 0.001754 | 0.804422 | 0.170530 | 0.183060 | 0.040816 | 0.227397 |
Alabama-Blount | 0.005722 | 0.005738 | 0.005707 | 0.086086 | 0.880762 | 0.017462 | 0.003257 | 0.002404 | 0.0 | 0.006970 | ... | 0.012640 | 0.010230 | 0.061828 | 0.767263 | 0.004751 | 0.892857 | 0.127483 | 0.114754 | 0.040816 | 0.210959 |
5 rows × 34 columns
Step 3 Train PCA with SageMaker
Now we can apply PCA to perform dimensionality reduction. We will use SageMaker's builtin PCA algorithm.
3.1 Prepare the Model
3.2 Train the Model
3.3 Load Model Artifacts (w/o using Predict)
Many of the Amazon SageMaker algorithms use MXNet for computational speed, including PCA, and so the model artifacts are stored as an array. After the model is unzipped and decompressed, we can load the array using MXNet.
3.4 PCA Model Attributes
Three types of model attributes are contained within the PCA model.
mean: The mean that was subtracted from a component in order to center it.
v: The makeup of the principal components; (same as ‘components_’ in an sklearn PCA model).
s: The singular values of the components for the PCA transformation.
The singular values do not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.
Explained Variance
From s, we can get an approximation of the data variance that is covered in the first n
principal components. The approximate explained variance is given by the formula: the sum of squared s values for all top n components over the sum over squared s values for all components:
\begin{equation*} \frac{\sum_{n}^{ } s_n^2}{\sum s^2} \end{equation*}
From v, we can learn more about the combinations of original features that make up each principal component.
0 | |
---|---|
28 | 7.991313 |
29 | 10.180052 |
30 | 11.718245 |
31 | 13.035975 |
32 | 19.592180 |
3.5 Examine Component Makeup
We can now examine the makeup of each PCA component based on the weights of the original features that are included in the component.
Each component is a linearly independent vector. The component represents a new basis in a projected component space. When we compare the feature weights, we are effectively asking, for this new basis vector, what is its corrrelation to the original feature space? For example, component 1 is negatively correlated with White
percentage in the census data. Component 2 is positvely correlated with PrivateWork
percentage in census data.
Step 4 Deploy PCA with SageMaker
Now we can deploy the PCA model and use it as an endpoint without having to dig into the model params to transform an input.
We don't need to use RecordSet
anymore once the model is deployed. It will simply accept a numpy
array to yield principal components.
SageMaker PCA returns a list of protobuf Record
message, same length as the training data which is 3218 in this case. The protobuf Record
message has the following format.
Essentially each data point is now projected onto a new component space. We can retrieve the projection by using the following syntax.
Now we can transform the PCA result into a DataFrame that we can work with.
c1 | c2 | c3 | c4 | c5 | c6 | c7 | |
---|---|---|---|---|---|---|---|
Alabama-Autauga | -0.060274 | 0.160527 | -0.088356 | 0.120480 | -0.010824 | 0.040452 | 0.025895 |
Alabama-Baldwin | -0.149684 | 0.185969 | -0.145743 | -0.023092 | -0.068677 | 0.051573 | 0.048137 |
Alabama-Barbour | 0.506202 | 0.296662 | 0.146258 | 0.297829 | 0.093111 | -0.065244 | 0.107730 |
Alabama-Bibb | 0.069224 | 0.190861 | 0.224402 | 0.011757 | 0.283526 | 0.017874 | -0.092053 |
Alabama-Blount | -0.091030 | 0.254403 | 0.022714 | -0.193824 | 0.100738 | 0.209945 | -0.005099 |
Step 5 Train KMeans with SageMaker
We will arbitrarily pick 8 clusters for KMeans. To train the model, we need to pass in RecordSet
again.
Step 6 Clustering
Now we can perform clustering and explore the result of clustering.
6.1 Explore the Resultant Clusters
Let's see which cluster is each data point assigned to. We can simply randomly select few data points by indices and check their cluster information using the same indices.
Let's take a look at the data point distribution across clusters.
6.2 Load Model Artifacts (instead of using Predict)
We want to dig a little deeper to understand where are the centroids and what do they look like in the 7-dimensional space.
There is only 1 set of model parameters contained within the K-means model; the cluster centroid locations in PCA-transformed, component space.
c1 | c2 | c3 | c4 | c5 | c6 | c7 | |
---|---|---|---|---|---|---|---|
0 | -0.022517 | 0.089059 | 0.171574 | -0.048067 | 0.006329 | 0.116637 | -0.023559 |
1 | -0.179321 | 0.071831 | -0.318079 | 0.070527 | -0.024473 | 0.057664 | 0.012471 |
2 | 0.378790 | 0.248040 | 0.078423 | 0.270251 | 0.078738 | -0.071632 | 0.051689 |
3 | 1.221510 | -0.239929 | -0.196319 | -0.402160 | -0.086975 | 0.091185 | 0.114385 |
4 | 0.207565 | -0.146924 | -0.094510 | -0.109183 | 0.091221 | -0.093371 | -0.076597 |
5 | -0.262178 | -0.377524 | 0.085250 | 0.084684 | 0.058606 | -0.003278 | 0.084815 |
6 | -0.173841 | 0.061865 | 0.025476 | -0.061420 | -0.052000 | -0.038968 | -0.014179 |
7 | 0.635135 | -0.585450 | 0.108298 | 0.284684 | -0.250434 | 0.026873 | -0.223216 |
6.3 Visualize Centroids in Component Space
We can't visualize a 7-dimensional centroid in Cartesian space, but we can plot a heatmap of the centroids and their location in transformed feature space.
6.4 Examine the Grouping
Finally we should map the labels back to the census transformed DataFrame
to understand the grouping of different counties.
c1 | c2 | c3 | c4 | c5 | c6 | c7 | labels | |
---|---|---|---|---|---|---|---|---|
Tennessee-Wayne | 0.063113 | 0.104933 | 0.285298 | -0.041470 | -0.083603 | 0.179846 | -0.040195 | 0 |
Georgia-Morgan | -0.024062 | 0.063064 | -0.021719 | 0.181931 | 0.091352 | 0.053852 | 0.089129 | 0 |
Georgia-Murray | 0.002309 | 0.277421 | 0.149035 | -0.284987 | 0.033298 | -0.058308 | -0.006328 | 0 |
Kentucky-Garrard | -0.077484 | 0.158746 | 0.149348 | -0.117317 | 0.036888 | 0.273137 | 0.018827 | 0 |
North Carolina-Stanly | 0.004421 | 0.140494 | 0.077001 | -0.009013 | -0.050588 | 0.053955 | -0.044574 | 0 |
Kentucky-Floyd | 0.050515 | 0.081900 | 0.294386 | -0.073164 | -0.201384 | 0.176484 | 0.052943 | 0 |
Georgia-Oglethorpe | -0.023825 | 0.027835 | 0.080744 | 0.103825 | 0.138597 | 0.102568 | 0.076467 | 0 |
Kentucky-Fleming | -0.142175 | 0.087977 | 0.215186 | -0.122501 | 0.099160 | 0.152931 | 0.051317 | 0 |
Virginia-Wise | -0.036725 | 0.082495 | 0.166730 | -0.020731 | -0.054882 | 0.072931 | -0.018200 | 0 |
Kentucky-Grant | -0.124348 | 0.271452 | 0.145990 | -0.192828 | 0.035439 | 0.095130 | -0.112029 | 0 |
Georgia-Pierce | -0.001628 | 0.073671 | 0.147948 | -0.035196 | 0.075766 | 0.101444 | 0.078241 | 0 |
Georgia-Polk | 0.128671 | 0.222121 | 0.093260 | -0.101383 | 0.059719 | -0.031002 | 0.054587 | 0 |
Kentucky-Estill | 0.030883 | 0.267407 | 0.415052 | -0.190667 | 0.073437 | 0.229371 | -0.137457 | 0 |
Georgia-Putnam | 0.079529 | 0.146343 | -0.065017 | 0.147295 | 0.071503 | -0.023137 | -0.019267 | 0 |
Kentucky-Elliott | 0.076525 | -0.027717 | 0.516782 | -0.074577 | 0.152140 | 0.329290 | -0.080332 | 0 |
Georgia-Rabun | -0.020311 | -0.128632 | 0.188155 | -0.050324 | -0.024796 | 0.042600 | 0.012712 | 0 |
Kentucky-Edmonson | -0.128306 | 0.062065 | 0.189488 | -0.078180 | 0.051537 | 0.201547 | -0.129917 | 0 |
North Carolina-Stokes | -0.131478 | 0.169661 | 0.124303 | -0.096798 | -0.011888 | 0.079122 | 0.032037 | 0 |
Kentucky-Cumberland | -0.059693 | 0.071140 | 0.271004 | -0.054590 | -0.116546 | -0.000517 | 0.087996 | 0 |
Georgia-Pike | -0.145351 | 0.186637 | -0.025261 | 0.015897 | 0.128544 | 0.251605 | -0.028959 | 0 |
Step 7 Cleanup
Delete all the endpoints.
Last updated