Democrat or Republican? Part 2, Politics and the Random Forest Classifier

9 min readApr 9, 2021

Left: Former President Andrew Jackson’s Twitter avatar, Right: Silhouette of the Great White Hope

What Demographics Can Tell Us About Voters’ Choice of Party in House Elections

Following up on the previous Democrat or Republican? article, I felt it appropriate to expand my inquiry to ask what demographic factors other than ethnicity and race contribute to voters’ choice of party in House elections. For this expanded project, I was fortunate to have discovered the Cenus Bureau’s My Congressional District web application where a host of demographic data ranging from sex and age to education level were readily available in .csv files. At the inception of this project, I was taking a course in Intermediate R Coding, so I decided to do this project R. In addition to a notebook published on RPubs, I also made an corresponding app in the shiny framework. A video demonstration of the the shiny app is included below for the reader’s enjoyment.

shiny App

shiny Application

The shiny application shown in the video is fairly simple, involving only three drop-down boxes with limited reactivity. The Actual and Predicted maps for both the selected state and congressional district are made using shape files from the Census Bureau’s Topologically Integrated Geographically Encoding and Referencing (TIGRIS) database, accessed through the tigris package. These maps along with the graphs displaying demographic data are simply displayed in ggplot2. I may return to this project and build a version with a single color-coded nationwide map displayed in leaflet with a switch allowing the user to toggle between Actual and Predicted maps. Many scripts in the project would also benefit from being refactored from procedural and functional approaches to an object-oriented approach. This would require me to learn object-orientation in R, something I plan to explore in a future article.

Model

Hypothesis

The findings outlined below suggest that population density is the primary factor associated with the party affiliation of a voting district’s elected representative. Higher ratios of those demographics correlated with a high population density are characteristic of Democratic districts; whereas, those demographics correlated with a lower population density were characteristic of Republican districts.

Methodology and Model Performance

Following a pattern similar to that used in the previous project, I used the Census Bureau’s 2019 American Community Survey estimates for each voting district, and wherein it was reasonable to do so, converted these estimates to percentages of the district’s total population. These data were then scaled to account for those variables for which percentages were not taken, such as median rent. These data were then fed into the model, a the randomForest package’s implementation of the random forest classifier with all of the default hyper-parameters. Considering that this was a fairly balanced binary classification problem, accuracy was chosen as the primary performance metric. This model netted 80.46% beating a baseline accuracy of 51.03%, wherein every district was assumed to send a Democratic congressman to the House, by 29.43%.

Statistical Insights

Examining the mean decrease in Gini impurity, a measure which indicates the degree to which a given variable plays a role in classifying the data, we find that the following ten variables contributed the most to the model’s final prediction.

Though the mean decrease in Gini impurity tells us the importance of a given variable in the model’s classification of a district as either Democrat or Republican, it does not tell us what relationship these variables have to the target variable. Looking at individual trees from the forest may give us a clue. Due to the sheer number of trees in our random forest model, however, it cannot be assumed that any given tree would perform well enough as a predictor to represent the Forest as a whole. At this stage, a graphical approach looking at individual, randomly-selected, representative voting districts seems to be the most intuitive approach.

Given the demographics with the highest mean decrease in Gini impurity, the graphs below will represent the following categories:

Race
Place of Birth
Commuter Method
Housing

Race

Left: Estimated racial composition of Massachusetts 7th Congressional District, Ayana Pressley (D); Right: Same statistics of Arizona’s 8th Congressional District, Debbie Lesko (R)

From the two districts sampled above, Massachusetts District 7 and Arizona District 8, we can see clearly that the former, a Democratic district has a greater degree of racial diversity compared with the latter. In the former whites make up only 49.45% of the population compared with the latter’s 82.82%, a difference of 33.37%. Asians also make up around 3 times the percentage of total population in the Massachusetts district, compared to that of the Arizona district. The percentage of persons of other races is also higher in the Democratic district compared with that of the Republican. As discussed in the previous article, this relationship is not to be viewed of as directly causal. Whites do not universally vote for Republicans as the causal interpretation might suggest, rather those whites who live in more rural, less diverse parts of the country tend to. As the reader will see below, race along with many of the other high importance factors stated above are largely surrogates for population density.

As of the 2016 congressional race, Massachusetts District 7 ranked the 17th highest in population density, with Arizona District 8 lower, in the 130th place.

Place of Birth

Continuing our analysis with two different sampled districts, Washington District 6, a Democratic district, and Alabama District 3, a Republican district, we find that a higher ratio of foreign-born citizens to natural-born citizens seems to be indicative that a given voting district with vote Democrat. The vast majority of Americans nationwide are born with citizenship rather than are born non-citizens and later naturalized. According to the Census Bureau’s American Community Survery, in 2014, the total population of the United States was estimated at 314 million people. 87% were natural-born citizens, compared to only 6% naturalized citizens, with the remaining 7% being non-citizens.2 Given this fact, the magnitude of difference between the ratio of two voting districts’ foreign-born populations is typically in the single digits or lower. This is the case here where the difference between the ratios of foreign-born citizens to the total between the Washington district and the Alabama is only 3.17%. This small magnitude of difference, on the micro-level can translate into a large difference in the political balance of a state on the macro-level. For instance, there may not be a double-digit difference when comparing most individual voting districts with regard to their ratio of foreign-born to natural-born citizens, but at the state level, these small differences add up. According to the above-mentioned source, California, an overwhelmingly Democratic state, has a population of 27% naturalized citizens compared to Republican West Virginia’s 1.5%. Given that a great number of immigrants come from overseas or Latin America, with the former settling primarily in coastal, urban areas, these differences present further evidence for the population density hypothesis. By extension the difference in policy preferences between folks who live in urban versus rural and suburban areas translate into the differences in party affiliation we find between the above-sampled districts, and between Los Angeles and Charleston.

Washington District 6 ranked 335 highest in population density, with Alabama District 3 slightly lower, in the 342th place. The Democratic voting district’s lower population density presents something of a problem for our hypothesis. It seems that geography and diversity itself also play roles in voter behavior independent of population density. We can speculate that the district’s location on the West Coast coupled with the fact that it contains an Indian reservation both play a factor in the Democratic party’s influence in the district. This latter trend can be seen in Arizona, where the model predicted that Arizona District 1 containing part of the Navajo Nation would vote Republican when the district actually voted Democrat. For reference, Arizona District 1 was the Democratic district with the lowest population density, ranking 424. It should also be noted that the Washington District 6 also contains Olympic National Park which may artificially lower the district’s population density.

Map of Arizona’s Congressional Districts’ Predicted and Actual Election Results; Red = Republican, Blue = Democrat

Commuter Method

Left: Relative Frequency of Car Commuters and Public Transit Users in New York’s 10th Congressional District, Mondaire Jones (D); Right: Same statistics for Pennsylvania’s 10th Congressional District, Scott Perry (R)

The comparison here between Democrat-held New York District 17 and Republican-held Pennsylvania District 10 is an obvious indicator of the relationship between population density and party affiliation. People who live and work in or near large cities tend to use public transit or simply walk to work more often than those who live in rural and suburban communities. In this case, New York District 17, being very near to the most populous city in the country, would naturally have a higher ratio of folks who walk or take public transit to those who commute by car than Pennsylvania District 10. The difference in the percentage of pedestrians and those who take public transit to work the total number of commuters between these two districts is 12.73%.

The former ranked 110th in population density, with the latter ranked the 351st most densely populated.

Housing

Another obvious stand-in for population density can be found in the ratios of owner-occupied units to total housing units and of renter-occupied units. Echoing advice attributed to Mark Twain to “buy land, they aren’t making any more of it,” the comparative scarcity of land in and immediately around urban parts of the country drives the price of land up. This makes home-ownership a less attractive option for those living in these areas. This case is demonstrated here, with Florida District 24’s percentage of Owner-occupied Units to the total being 45.09%, 17.75% lower than that of Georgia District 12.

The former ranked 31st in population density, with the latter ranked the 350th most densely populated.

Density plot of median rents by congressional district for the United States — Red: Median rent of Michigan’s 6th Congressional District, Fred Upton (R); Blue: Same statistic for California’s 46th District, Lou Correa (D)

Expanding on the relationship between homeownership and rental ratios, population density, and party affiliation, here we see that the value of land also pushes rental prices up. Here Michigan District 6’s median monthly rent, $803 is only marginally lower than above-shown distribution plot’s global maximum, $854. Compare this to California District 46’s median monthly rent $1693. The absolute difference between Democratic district’s median monthly rent and the global maximum is over 16 times that of the Republican district. These greater rent prices are a strong indicator of one of the primary reasons people who live in more densely populated areas would choose Democratic representation over Republican representation. The former party’s rhetoric likely appeals to many who feel that “rent is too damn high!”. Unfortunately, those who push for rent control are either ignorant of or willfully unconcerned with Basic Economics.

Car with political advertisement supporting the campaign of the Rent Is Too Damn High Party’s candidate in New York’s 2010 gubernatorial race, James McMillan III

California District 46 ranked 21st in population density. Michigan District 6 ranked 255th in population density.

Summary

Although the Washington 6th District did present something of a problem for the population density hypothesis, aggregate data do indeed demonstrate that the pattern typically holds true. According to the 2014 American Community Survey, 89 out of the top 100 most densely populated congressional districts were represented by Democrats, and 81 out of the 100 least densely populated congressional districts were represented by Republicans. The difference in values and in the interest groups found in urban areas versus those found in rural areas often translate into this difference in party preference.