Using ML k-means clustering to analyze natural hazard frequency across the United States
Ecoregion have been modeled to aid researchers, scientists, and the public at large, visualize and understand similarities across complex multivariate environmental factors by grouping or clustering areas into like categories. Great debate surrounds, “whether ecoregions should be specialized for particular use or general purpose, spatially contiguous versus, or nestle versus nonhierarchical, (Hargrove).”
This analysis draws on these machine learning modeling techniques to develop natural hazard regions of the contiguous United States. It first uses the k-means algorithm to cluster hazard frequency clusters without spatial information, and then visualizes and assesses these clusters after reassigning these spatial attributes. Instead of viewing hazard risk in the context of a particular categorical frequency, i.e. tornado alley, this analysis has been conducted to view hazards holistically, to develop hazard profiles for each cluster and region.
The dataset used for this analysis is the Spatial Hazard Events and Losses Database for the United States (SHELDUS). This is a county-level categorical hazard dataset that covers natural hazard occurrences from 1960 to present day. Originally developed by the Hazards and Vulnerability Research Institute at the University of South Carolina, it now supported and maintained by the Arizona State University Center for Emergency Management and Homeland Security. Considered to be the most cohesive and extensive dataset of its kind, this analysis uses its location data as well as the aggregated categorical hazard frequencies for its modeling and clustering algorithms.
The full SHELDUS hazard dataset was filtered to include only the contiguous United States in order to maintain, during the visualization section of the analysis, cohesive geographical, ecological, and spatial attributes, discussed in more detail below. From this filtered set, each category of hazard was aggregated by the number of occurrences that were recorded between 1960 and 2018 in each particular county. At this stage, the counties were assigned an ID number, and their spatial attributes were removed to have no effect on the clustering algorithm. ID numbers were used later in the analysis to reconnect the clusters to their spatial components and visualize the data.
For this model, each hazard category is given equal weight during the clustering or grouping. The concept of standardization is important when continuous independent variables are measured at different scales. Because frequencies have a wide range between hazard categories, i.e. there are significantly more recorded floods than avalanches, the original variable frequencies were rescaled to their variance. This allows the variables, or categories of hazards, to give an equal contribution to the analysis. This was accomplished using the SciKit Learn StandardScaler algorithm and written into the data frame python script.
Silhouette analysis is a visual means to study the separation distance between resulting clusters. The plot displays a measure of how close each point in one cluster is to the points in the neighboring clusters and provides a way to assess optimal k-seed value for k-means clustering algorithms. Within the plot, it is best to have most points between 0 and 1, and depending on data, a fairly consistent dispersion of silhouette plot thickness along the y-axis between each cluster or color. After reviewing a wide variety of k-seed values, the optimal cluster value for this analysis was n = 28. This was used for the k-means clustering algorithm and subsequently divided the contiguous United States into 28 hazard regions. This SciKit Learn silhouette algorithm was written into the data frame python script.
The k-means algorithm is an unsupervised machine learning clustering technique and was used to classify the natural hazard frequency data of each county by assigning samples into n = 28 clusters. These centroids minimize a criterion known as the inertia or within-cluster sum-of-squares. By running this algorithm over the dataset including the 18 natural hazards categories, clusters were assigned holistically, with each category having an equal weight. For example, areas with high frequencies of earthquakes, volcano, and wildfire, but low frequencies of fog and wind, were clustered together. This analysis was also written into the python script and applied to the data frame. Each cluster or group was then visualized with bar graphs representing the percentage of hazard occurrences within the cluster compared to sum occurrences within the entirety of the dataset. This allows us to understand each cluster’s defining hazard traits and further understand its hazard profile.
Now that clusters have been assigned to define natural hazard profiles without the input of spatial information, the analysis turns back to spatial information to see patterns and define regions. Using the ID assigned in the filtering step, spatial information is joined to the data frame, now utilizing the tools available in QGIS. While the regions are not entirely distinct from one another, and certainly overlap in areas, It is somewhat remarkable how regionally dispersed they tend to be. This analysis and visualization define the natural hazard profiles spatially, creating not just regions defined by high frequencies of a particular hazard category, but a holistic risk understanding of particular region by all the natural hazards it faces.
After visualizing the clusters spatially, one starts to wonder what has caused particular outlier counties to be clustered with groups whom for the most part, share a fairly cohesive spatial dispersion. These outlier counties require further understanding, and to begin this process, we define these using centroids and distance matrixes. Each county centroid is measured by the distance from its assigned cluster centroid. This distance matrix is used to calculate mean distance, across the cluster, from the cluster centroid. Any county that is more than twice the distance from its cluster centroid than the mean distance, is filtered out and visualized. These 143 counties, while biased to being located on the edges of the geography analyzed, are not all explained by this. Overall, they are dispersed throughout the country, and this requires further exploration.
Within one of the fairly well defined regional cluster, there is one county that is a significant spatial outlier. The county in North Carolina has been assigned a cluster that otherwise, contains counties consistently spatially located in the midwestern section of the country. To understand what is happening here, first we use the hazard profile graph to understand the cluster’s traits, which illustrates a defining attribute is the frequency of hail. Looking back at the outlier county in North Carolina, and comparing it with the counties it shares a spatial border with in North Carolina, it becomes clear this county, for some reason, has a significantly higher frequency of hail. What is causing this county, compared with counties adjacent to it, to have more hail occurrences documented requires more investigation. This case study shows the potential findings that may lie within the outlier dataset.