Outlier Detection: Osun State 2023 Presidential Election Data Using Geospatial Analysis

Adewoye Saheed Damilola
Sep 6, 2024
3 min read

Updated: Nov 10, 2024

Note: You can access the full notebook here.

Introduction

Geospatial data analysis helps to understand patterns, relationships, and trends related to specific locations on Earth. Some of the use cases of spatial analysis include:

Urban Planning: Analyzing land use patterns.
Environmental Monitoring: Tracking climate change, deforestation, and land cover changes.
Transportation: Optimizing routes and managing traffic flow.

Spatial analysis can also be used to detect outlier anomalies in a dataset. This is achieved through various techniques such as Anselin Local Moran’s I, Getis-Ord Gi statistic, and distance-based methods.

Using the distance-based method of identifying outliers, here is a quick report of outlier detection on the 2023 presidential election in Osun State, Nigeria.

Dataset Preparation

I used two datasets for my analysis. One contains information about polling units without their latitude and longitude and the other includes the latitude and longitude information.

# merge the seperate two files
df1=pd.read_csv('C:/Users/WOYES/Desktop/HNG/OSUN_crosschecked.csv') 

# contains information about polling units without their latitude and longitude
df2=pd.read_csv('C:/Users/WOYES/Desktop/HNG/latlong.csv') 
            
# contains information about the latitude and longitude of each polling unit
df = df1.merge(df2, on='PU-Name', how='inner')df.head()

I performed some exploratory analysis on the dataset and subsets the valuable features for my analysis.

from the Author’s notebook

Neighbourhood Identification

I use the distance_matrix function from scipy.spatial To compute the distances between the latitude and longitude of all pairs of polling units. Utilizing distance_matrix helps to identify clusters of neighbouring polling units and distances among them. I assumed a distance of ≤ 1km for any neighbouring polling unit.

# Reset index to ensure sequential indexing
data.reset_index(drop=True, inplace=True)

# Extract latitude and longitude columns
lat_lon = data[['latitude', 'longitude']].values

# Calculate the distance matrix between all polling units
dist_matrix = distance_matrix(lat_lon, lat_lon)dist_matrix

Outlier Detection

For each party (APC, LP, PDP, NNPP) in the dataset, I calculated their outlier score as the absolute difference between the votes at the current polling unit and the mean votes of its neighbours. The neighbour mean votes serve as a comparative baseline that adjusts for local variations or anomalies in the voting pattern. These anomalies may indicate unique local circumstances, such as strong support for a particular party in a specific neighbourhood or discrepancies that warrant further investigation.

from the Author’s notebook

Here’s a visualization showing outliers in each political party.

from Author’s notebook

Sorting

For each political party based on their outlier scores, let’s sort our dataset for the top three outlier polling units.

out = outlier_scores.drop(columns=['Neighbour_polls'])
# Sort the dataset by the outlier scores for each party
sorted_apc = out.sort_values(by='APC_outlier', ascending=False).head(3)sorted_lp = out.sort_values(by='LP_outlier', ascending=False).head(3)sorted_pdp = out.sort_values(by='PDP_outlier', ascending=False).head(3)sorted_nnpp = out.sort_values(by='NNPP_outlier', ascending=False).head(3)

# Display the DataFrames
print('\033[1m' + 'Top 3 APC Outliers:' + '\033[0m')display(HTML(sorted_apc.to_html()))display(HTML('<hr style="height:3px;border:none;color:#333;background-color:#333;" />'))  

# Horizontal line separator
print('\033[1m' + 'Top 3 PDP Outliers:' + '\033[0m')display(HTML(sorted_pdp.to_html()))display(HTML('<hr style="height:3px;border:none;color:#333;background-color:#333;" />')) 

 # Horizontal line separator
print('\033[1m' + 'Top 3 LP Outliers:' + '\033[0m')display(HTML(sorted_lp.to_html()))display(HTML('<hr style="height:3px;border:none;color:#333;background-color:#333;" />'))  

# Horizontal line separator
print('\033[1m' + 'Top 3 NNPC Outliers:' + '\033[0m')display(HTML(sorted_pdp.to_html()))

from Author’s notebook

Conclusion

The analysis shows that the four political parties had some outliers in various polling units across the state. Several factors might account for these anomalies, including campaign efforts, candidate influence, demographic differences, errors in vote counting, and ballot stuffing.

Thanks for coming this far!