Population Segmentation with PCA and KMeans

We are deploying two unsupervised algorithms to perform population segmentation on US census data.

Using principal component analysis (PCA) to reduce the dimensionality of the original census data. Then apply k-means clustering to assign each US county to a particular cluster based on where a county lies in component space. This allows us to observe counties that are similiar to each other in socialeconomic terms.

import pandas as pd
import numpy as np
import os
import io

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

import boto3
import sagemaker

Step 1 Load Data from S3

data_bucket = 'aws-ml-blog-sagemaker-census-segmentation'

s3_client = boto3.client('s3')
obj_list=s3_client.list_objects(Bucket=data_bucket)

keys=[]
for contents in obj_list['Contents']:
    keys.append(contents['Key'])

# We should only get one key, which is the CSV file we want.
if len(keys) != 1:
    raise RuntimeError('received unexpected number of keys from {}'.format(data_bucket))

data_object = s3_client.get_object(Bucket=data_bucket, Key=keys[0])
data_body = data_object["Body"].read() # in Bytes
data_stream = io.BytesIO(data_body)

census_df = pd.read_csv(data_stream, header=0, delimiter=",")
display(census_df.head())
CensusId
State
County
TotalPop
Men
Women
Hispanic
White
Black
Native
...
Walk
OtherTransp
WorkAtHome
MeanCommute
Employed
PrivateWork
PublicWork
SelfEmployed
FamilyWork
Unemployment

0

1001

Alabama

Autauga

55221

26745

28476

2.6

75.8

18.5

0.4

...

0.5

1.3

1.8

26.5

23986

73.6

20.9

5.5

0.0

7.6

1

1003

Alabama

Baldwin

195121

95314

99807

4.5

83.1

9.5

0.6

...

1.0

1.4

3.9

26.4

85953

81.5

12.3

5.8

0.4

7.5

2

1005

Alabama

Barbour

26932

14497

12435

4.6

46.2

46.7

0.2

...

1.8

1.5

1.6

24.1

8597

71.8

20.8

7.3

0.1

17.6

3

1007

Alabama

Bibb

22604

12073

10531

2.2

74.5

21.4

0.4

...

0.6

1.5

0.7

28.8

8294

76.8

16.1

6.7

0.4

8.3

4

1009

Alabama

Blount

57710

28512

29198

8.6

87.9

1.5

0.3

...

0.9

0.4

2.3

34.9

22189

82.0

13.5

4.2

0.4

7.7

5 rows × 37 columns

Step 2 Explore & Clean Data

TotalPop
Men
Women
Hispanic
White
Black
Native
Asian
Pacific
Citizen
...
Walk
OtherTransp
WorkAtHome
MeanCommute
Employed
PrivateWork
PublicWork
SelfEmployed
FamilyWork
Unemployment

Alabama-Autauga

55221

26745

28476

2.6

75.8

18.5

0.4

1.0

0.0

40725

...

0.5

1.3

1.8

26.5

23986

73.6

20.9

5.5

0.0

7.6

Alabama-Baldwin

195121

95314

99807

4.5

83.1

9.5

0.6

0.7

0.0

147695

...

1.0

1.4

3.9

26.4

85953

81.5

12.3

5.8

0.4

7.5

Alabama-Barbour

26932

14497

12435

4.6

46.2

46.7

0.2

0.4

0.0

20714

...

1.8

1.5

1.6

24.1

8597

71.8

20.8

7.3

0.1

17.6

Alabama-Bibb

22604

12073

10531

2.2

74.5

21.4

0.4

0.1

0.0

17495

...

0.6

1.5

0.7

28.8

8294

76.8

16.1

6.7

0.4

8.3

Alabama-Blount

57710

28512

29198

8.6

87.9

1.5

0.3

0.1

0.0

42345

...

0.9

0.4

2.3

34.9

22189

82.0

13.5

4.2

0.4

7.7

5 rows × 34 columns

2.1 Visualize the Data

Use Histogram to plot the distribution of data by features.

png
png
png
png

2.2 Normalize the Data

To get a fair comparison between values, we need to normalize the data to [0, 1].

TotalPop
Men
Women
Hispanic
White
Black
Native
Asian
Pacific
Citizen
...
Walk
OtherTransp
WorkAtHome
MeanCommute
Employed
PrivateWork
PublicWork
SelfEmployed
FamilyWork
Unemployment

Alabama-Autauga

0.005475

0.005381

0.005566

0.026026

0.759519

0.215367

0.004343

0.024038

0.0

0.006702

...

0.007022

0.033248

0.048387

0.552430

0.005139

0.750000

0.250000

0.150273

0.000000

0.208219

Alabama-Baldwin

0.019411

0.019246

0.019572

0.045045

0.832665

0.110594

0.006515

0.016827

0.0

0.024393

...

0.014045

0.035806

0.104839

0.549872

0.018507

0.884354

0.107616

0.158470

0.040816

0.205479

Alabama-Barbour

0.002656

0.002904

0.002416

0.046046

0.462926

0.543655

0.002172

0.009615

0.0

0.003393

...

0.025281

0.038363

0.043011

0.491049

0.001819

0.719388

0.248344

0.199454

0.010204

0.482192

Alabama-Bibb

0.002225

0.002414

0.002042

0.022022

0.746493

0.249127

0.004343

0.002404

0.0

0.002860

...

0.008427

0.038363

0.018817

0.611253

0.001754

0.804422

0.170530

0.183060

0.040816

0.227397

Alabama-Blount

0.005722

0.005738

0.005707

0.086086

0.880762

0.017462

0.003257

0.002404

0.0

0.006970

...

0.012640

0.010230

0.061828

0.767263

0.004751

0.892857

0.127483

0.114754

0.040816

0.210959

5 rows × 34 columns

Step 3 Train PCA with SageMaker

Now we can apply PCA to perform dimensionality reduction. We will use SageMaker's builtin PCA algorithm.

3.1 Prepare the Model

3.2 Train the Model

3.3 Load Model Artifacts (w/o using Predict)

Many of the Amazon SageMaker algorithms use MXNet for computational speed, including PCA, and so the model artifacts are stored as an array. After the model is unzipped and decompressed, we can load the array using MXNet.

3.4 PCA Model Attributes

Three types of model attributes are contained within the PCA model.

  • mean: The mean that was subtracted from a component in order to center it.

  • v: The makeup of the principal components; (same as ‘components_’ in an sklearn PCA model).

  • s: The singular values of the components for the PCA transformation.

The singular values do not exactly give the % variance from the original feature space, but can give the % variance from the projected feature space.

Explained Variance

From s, we can get an approximation of the data variance that is covered in the first n principal components. The approximate explained variance is given by the formula: the sum of squared s values for all top n components over the sum over squared s values for all components:

\begin{equation*} \frac{\sum_{n}^{ } s_n^2}{\sum s^2} \end{equation*}

From v, we can learn more about the combinations of original features that make up each principal component.

0

28

7.991313

29

10.180052

30

11.718245

31

13.035975

32

19.592180

3.5 Examine Component Makeup

We can now examine the makeup of each PCA component based on the weights of the original features that are included in the component.

Each component is a linearly independent vector. The component represents a new basis in a projected component space. When we compare the feature weights, we are effectively asking, for this new basis vector, what is its corrrelation to the original feature space? For example, component 1 is negatively correlated with White percentage in the census data. Component 2 is positvely correlated with PrivateWork percentage in census data.

png
png

Step 4 Deploy PCA with SageMaker

Now we can deploy the PCA model and use it as an endpoint without having to dig into the model params to transform an input.

We don't need to use RecordSet anymore once the model is deployed. It will simply accept a numpy array to yield principal components.

SageMaker PCA returns a list of protobuf Record message, same length as the training data which is 3218 in this case. The protobuf Record message has the following format.

Essentially each data point is now projected onto a new component space. We can retrieve the projection by using the following syntax.

Now we can transform the PCA result into a DataFrame that we can work with.

c1
c2
c3
c4
c5
c6
c7

Alabama-Autauga

-0.060274

0.160527

-0.088356

0.120480

-0.010824

0.040452

0.025895

Alabama-Baldwin

-0.149684

0.185969

-0.145743

-0.023092

-0.068677

0.051573

0.048137

Alabama-Barbour

0.506202

0.296662

0.146258

0.297829

0.093111

-0.065244

0.107730

Alabama-Bibb

0.069224

0.190861

0.224402

0.011757

0.283526

0.017874

-0.092053

Alabama-Blount

-0.091030

0.254403

0.022714

-0.193824

0.100738

0.209945

-0.005099

Step 5 Train KMeans with SageMaker

We will arbitrarily pick 8 clusters for KMeans. To train the model, we need to pass in RecordSet again.

Step 6 Clustering

Now we can perform clustering and explore the result of clustering.

6.1 Explore the Resultant Clusters

Let's see which cluster is each data point assigned to. We can simply randomly select few data points by indices and check their cluster information using the same indices.

Let's take a look at the data point distribution across clusters.

png

6.2 Load Model Artifacts (instead of using Predict)

We want to dig a little deeper to understand where are the centroids and what do they look like in the 7-dimensional space.

There is only 1 set of model parameters contained within the K-means model; the cluster centroid locations in PCA-transformed, component space.

c1
c2
c3
c4
c5
c6
c7

0

-0.022517

0.089059

0.171574

-0.048067

0.006329

0.116637

-0.023559

1

-0.179321

0.071831

-0.318079

0.070527

-0.024473

0.057664

0.012471

2

0.378790

0.248040

0.078423

0.270251

0.078738

-0.071632

0.051689

3

1.221510

-0.239929

-0.196319

-0.402160

-0.086975

0.091185

0.114385

4

0.207565

-0.146924

-0.094510

-0.109183

0.091221

-0.093371

-0.076597

5

-0.262178

-0.377524

0.085250

0.084684

0.058606

-0.003278

0.084815

6

-0.173841

0.061865

0.025476

-0.061420

-0.052000

-0.038968

-0.014179

7

0.635135

-0.585450

0.108298

0.284684

-0.250434

0.026873

-0.223216

6.3 Visualize Centroids in Component Space

We can't visualize a 7-dimensional centroid in Cartesian space, but we can plot a heatmap of the centroids and their location in transformed feature space.

png

6.4 Examine the Grouping

Finally we should map the labels back to the census transformed DataFrame to understand the grouping of different counties.

c1
c2
c3
c4
c5
c6
c7
labels

Tennessee-Wayne

0.063113

0.104933

0.285298

-0.041470

-0.083603

0.179846

-0.040195

0

Georgia-Morgan

-0.024062

0.063064

-0.021719

0.181931

0.091352

0.053852

0.089129

0

Georgia-Murray

0.002309

0.277421

0.149035

-0.284987

0.033298

-0.058308

-0.006328

0

Kentucky-Garrard

-0.077484

0.158746

0.149348

-0.117317

0.036888

0.273137

0.018827

0

North Carolina-Stanly

0.004421

0.140494

0.077001

-0.009013

-0.050588

0.053955

-0.044574

0

Kentucky-Floyd

0.050515

0.081900

0.294386

-0.073164

-0.201384

0.176484

0.052943

0

Georgia-Oglethorpe

-0.023825

0.027835

0.080744

0.103825

0.138597

0.102568

0.076467

0

Kentucky-Fleming

-0.142175

0.087977

0.215186

-0.122501

0.099160

0.152931

0.051317

0

Virginia-Wise

-0.036725

0.082495

0.166730

-0.020731

-0.054882

0.072931

-0.018200

0

Kentucky-Grant

-0.124348

0.271452

0.145990

-0.192828

0.035439

0.095130

-0.112029

0

Georgia-Pierce

-0.001628

0.073671

0.147948

-0.035196

0.075766

0.101444

0.078241

0

Georgia-Polk

0.128671

0.222121

0.093260

-0.101383

0.059719

-0.031002

0.054587

0

Kentucky-Estill

0.030883

0.267407

0.415052

-0.190667

0.073437

0.229371

-0.137457

0

Georgia-Putnam

0.079529

0.146343

-0.065017

0.147295

0.071503

-0.023137

-0.019267

0

Kentucky-Elliott

0.076525

-0.027717

0.516782

-0.074577

0.152140

0.329290

-0.080332

0

Georgia-Rabun

-0.020311

-0.128632

0.188155

-0.050324

-0.024796

0.042600

0.012712

0

Kentucky-Edmonson

-0.128306

0.062065

0.189488

-0.078180

0.051537

0.201547

-0.129917

0

North Carolina-Stokes

-0.131478

0.169661

0.124303

-0.096798

-0.011888

0.079122

0.032037

0

Kentucky-Cumberland

-0.059693

0.071140

0.271004

-0.054590

-0.116546

-0.000517

0.087996

0

Georgia-Pike

-0.145351

0.186637

-0.025261

0.015897

0.128544

0.251605

-0.028959

0

Step 7 Cleanup

Delete all the endpoints.

Last updated