In this project, we’ll work on clustering US Senators based on how they voted. We will try to find out whether the senators voted along party line or choose to be unaffiliated with a party?
What is unsupervised learning?
A major type of machine learning is called unsupervised learning.In unsupervised learning, we aren’t trying to predict anything. Instead, we’re finding patterns in data.One of the main unsupervised learning techniques is called clustering
Clustering
Clustering algorithms group similar rows together. There can be one or more groups in the data, and these groups form the clusters. As we look at the clusters, we can start to better understand the structure of the data.Clustering is a key way to explore unknown data, and it’s a very commonly used machine learning technique.
The Dataset
In the US, the Senate votes on proposed legislation. Getting a bill passed by the Senate is a key step towards getting its provisions enacted. A majority vote is required to get a bill passed.The results of these votes, known as roll call votes are public.
You can visit my GitHub repo for complete code.
Senators typically vote in accordance with how their political party votes, known as voting along party lines. In the US, the 2 main political parties are the Democrats, who tend to be liberal, and the Republicans, who tend to be conservative. Senators can also choose to be unaffiliated with a party, and vote as Independents, although very few choose to do so.
114_congress.csv contains all of the results of roll call votes from the 114th Senate. Each row represents a single Senator, and each column represents a vote. A 0 in a cell means the Senator voted No on the bill, 1 means the Senator voted Yes, and 0.5 means the Senator abstained.
Here are the relevant columns:
name — The last name of the Senator.
party — the party of the Senator. The valid values are D for Democrat, R for Republican, and I for Independent.
Several columns numbered like 00001, 00004, etc. Each of these columns represents the results of a single roll call vote.
Below are the first few rows and columns of the data.
| name | party | state | 00001 | 00004 | 00005 | 00006 |
| Alexander | R | TN | 0 | 1 | 1 | 1 |
| Ayotte | R | NH | 0 | 1 | 1 | 1 |
| Baldwin | D | WI | 1 | 0 | 0 | 1 |
| Barrasso | R | WY | 0 | 1 | 1 | 1 |
| Bennet | D | CO | 0 | 0 | 0 | 1 |
We’ll use an algorithm called k-means clustering to split our data into clusters. k-means clustering uses Euclidean distance to form clusters of similar Senators.
The k-means algorithm will group Senators who vote similarly on bills together, in clusters. Each cluster is assigned a center, and the Euclidean distance from each Senator to the center is computed. Senators are assigned to clusters based on which one they are closest to. From our background knowledge, we think that Senators will cluster along party lines.
The k-means algorithm requires us to specify the number of clusters upfront. Because we suspect that clusters will occur along party lines, and the vast majority of Senators are either Republicans or Democrats, we’ll pick 2 for our number of clusters.
Let’s get started
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
votes=pd.read_csv("114_congress.csv")
votes.head(5)
votes['party'].value_counts()
| R | 54 |
|---|---|
| D | 44 |
| I | 2 |
from sklearn.cluster import KMeans kmeans_model=KMeans(n_clusters=2,random_state=1) senator_distances=kmeans_model.fit_transform(votes.iloc[:,3:]) labels=kmeans_model.labels_ pd.crosstab(labels,votes['party'])
| party | D | I | R |
|---|---|---|---|
| labels | |||
| 0 | 41 | 2 | 0 |
| 1 | 3 | 0 | 54 |
pd.crosstab(labels,votes['party']).plot(kind='bar',stacked=True)
x=[0,1]
l=['cluster1','cluster2']
plt.xticks(x,l)
plt.title('Clustering')
plt.xlabel('Clusters')
plt.ylabel('No. of Senators')
plt.tick_params(bottom='off',top='off',right='off',left='off')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()

Democrats like Republicans
democratic_outlier=votes[(labels==1) & (votes['party']=='D')] democratic_outlier
| id
42 |
name
Heitkamp |
party
D |
|---|---|---|
| 56 | Manchin | D |
| 74 | Reid | D |
Independents like Democrats
independents_like_democrats=votes[(labels==0) &(votes['party']=='I')] independents_like_democrats
| id
50 |
name
King |
party
I |
|---|---|---|
| 79 | Sanders | I |
Radical Republicans!!
extremism = (senator_distances ** 3).sum(axis=1)
votes['extremism']=extremism
votes.sort_values('extremism',inplace=True,ascending=False)
| id
98 |
name
Wicker |
party
R |
extremism
46.250476 |
|---|---|---|---|
| 53 | Lankford | R | 46.046873 |
| 69 | Paul | R | 46.046873 |
| 80 | Sasse | R | 46.046873 |
| 26 | Cruz | R | 46.046873 |
Conclusions
Based on the voting patterns we could conclude that 3 Democrats were very much similar to the Republicans, the two independents voted in a similar manner like the Democrats.We were able to find the most radical Republican as well, that’s the power of clustering.
Thanks