Los Angeles Fire Risk Prediction

Los Angeles Fire Risk Prediction

1 Motivation

Our analysis offers a valuable tool for insurance companies to assess wildfire risk in Los Angeles with greater precision. By identifying areas with heightened fire risk, insurers can adjust policy pricing accordingly, say, they could sell the insurance contract with a higher price to residents in higher risk areas to mitigate potential losses. Additionally, this model allows for the discovery of untapped markets in high-risk zones. Moreover, insurance companies could provide the fire risk index to real estate platforms like Zillow, empowering homebuyers with crucial information on wildfire hazards, which could influence property value. This dual application serves both insurers and consumers, fostering informed decision-making in property insurance and real estate transactions.

2 Data Description

We use fire perimeter data downloaded from the California State Geoportal as our dependent variable. This database records fire perimeters in California from 2018 to 2021, encompassing both public and private lands. To predict the risk of wildfires, we incorporate the concentration of heat island effects across the City of LA (heat island), compile vegetation data from The Natural Communities Commonly Associated with Groundwater (NCCAG), and consider the flow of surface water from the National Hydrography Dataset by USGS (water cover) as risk factors in building our prediction model.

3 Exploratory Analysis

3.1 Data Reading and Cleaning

3.2 Data Visualization for CA Fire

fire_color <- "#FC4E2A"
fill_color <- "#FEB24C50"


for(year in 2018:2021) {

  yearly_data <- fire_data_cleaned %>% 
    filter(YEAR_ == year)
  

  static_map <- ggplot(data = yearly_data) +
    annotation_map_tile(zoom = 6) +  
    geom_sf(color = fire_color, fill = fill_color, size = 0.2) +  
    theme_minimal() +
    theme(axis.text = element_blank(), axis.title = element_blank()) +  
    labs(title = paste('California Fire Perimeters', year))
  print(static_map)
}

total_map <- ggplot(data = fire_data_cleaned) +
  annotation_map_tile(zoom = 6) + 
  geom_sf(color = fire_color, fill = fill_color, size = 0.2) +  
  theme_minimal() +
  theme(axis.text = element_blank(), axis.title = element_blank()) +  
  labs(title = 'California Fire Perimeters (2018-2021)')
print(total_map)

animate plot

fire_animation <-
  ggplot() +
   geom_sf(data = CA_boundary, alpha = 0.5) +
    geom_sf(data = fire_data_cleaned, aes(fill = YEAR_)) +
    scale_fill_manual(values = palette4) +
    labs(title = "wild fire perimeter between 2018 to 2021") +
    transition_manual(YEAR_) +
    mapTheme()

animate(fire_animation, duration=10, renderer = gifski_renderer())

animate(fire_animation, duration = 10, renderer = gifski_renderer("C:/Users/Lenovo/Desktop/PPA/final/fire_animation.gif"))

### Vegetation Distribution

ggplot(data = vegetation_data) +
  geom_sf() +
  theme_minimal() +
  labs(title = "Distribution of Vegetation ",
       subtitle = "source: Natural Communities Commonly Associated with Groundwater")

### Vegetation Distribution and Fire Relationship

crs_fire_data <- st_crs(fire_data_cleaned)
vegetation_data <- st_transform(vegetation_data, crs = crs_fire_data)

# Calculate area as numeric values
vegetation_data$vegetation_area <- as.numeric(st_area(vegetation_data))
fire_data_cleaned$fire_area <- as.numeric(st_area(fire_data_cleaned))

# Perform the spatial join
joined_data <- st_join(fire_data_cleaned, vegetation_data, join = st_intersects)

# Calculate vegetation coverage
joined_data <- joined_data %>%
  mutate(vegetation_coverage = vegetation_area / fire_area)

# Create the plot
library(scales)

min_coverage <- min(joined_data$vegetation_coverage, na.rm = TRUE)
max_coverage <- max(joined_data$vegetation_coverage, na.rm = TRUE)

# 格式化图例标签
formatted_min <- label_number(accuracy = 0.01)(min_coverage)
formatted_max <- label_number(accuracy = 0.01)(max_coverage)



ggplot() +
  geom_sf(data = joined_data, aes(fill = vegetation_coverage), size = 0.5, alpha = 0.5) +  
  geom_sf(data = fire_data_cleaned, color = "red", size = 0.1, alpha = 0.5) +  
  scale_fill_gradient(low = "#25CB10", high = "#FA7800",
                      breaks = c(min_coverage, max_coverage),  # 只显示最小值和最大值
                      labels = c(formatted_min, formatted_max),  # 格式化标签
                      na.value = NA) +   
  theme_minimal() +
  labs(title = "Vegetation Coverage Rate and Fire Incidents Relationship Map",
       fill = "Vegetation Coverage (%)") +  # 修改图例标题
  theme(legend.position = "bottom")

3.3 Animate map for LA

LAfire_animation <-
  ggplot() +
   geom_sf(data = LAcounty,alpha = 0.5) +
    geom_sf(data = LAfire_cleaned, aes(fill = YEAR)) +
    scale_fill_manual(values = palette4) +
    labs(title = "wild fire perimeter in LA between 2018 to 2022") +
    transition_manual(YEAR) +
    mapTheme()
LAfire_animation

3.4 Densitymap for LA

grid.arrange(ncol=2,
ggplot() + 
  geom_sf(data = LAcounty) +
  geom_sf(data = LAfire, colour="red", size=0.1, show.legend = "point") +
  labs(title= "LA history fire") +
  mapTheme(title_size = 14),
ggplot() + 
  geom_sf(data = LAcounty, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(LAfire)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_viridis() +
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Fire") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

3.5 Fire Hazard Spatial Plot

# Convert your data to sf format
fire_hazard_zone_sf <- st_as_sf(fire_hazard_zone, wkt = 'geometry')

# Plot the spatial data with hazard_class coloring
ggplot() +
  geom_sf(data = fire_hazard_zone_sf, aes(fill = HAZ_CLASS)) +
  ggtitle("Fire Hazard Spatial Plot Colored by Hazard Class") +
  scale_fill_viridis(discrete = TRUE)

4 Spatial Process

In the spatial process section of our analysis, we begin by creating a fishnet grid over the Los Angeles boundary. This grid provides a structured way to analyze spatial data.

The fishnet is then used to visualize fire incidents, where we overlay the fire data onto the grid and aggregate counts of fire occurrences within each cell. We visualize this as a heatmap to identify areas with higher frequencies of fires.

Next, we model spatial features by reading and transforming various geoJSON files for heat islands, water landcover, and vegetation. These features are spatially joined to our fishnet, allowing us to examine the influence of each environmental factor within the grid cells.

Then, we aggregate these features into the fishnet, creating separate visualizations for each. This involves summing up the occurrences of heat islands, vegetation, and water cover within each cell of the fishnet and then visualizing these counts through separate heatmaps. These visualizations help to understand the spatial distribution and concentration of these risk factors across the region.

After fanalizing the fishnet, we incorporate a Nearest Neighbor analysis to enhance our feature set. This technique quantifies the proximity of each fishnet cell to the nearest high-risk features, such as heat islands, vegetation, and water bodies. We calculate this for each environmental factor, providing a nuanced view of spatial risk factors that could influence fire incidence.

For each feature—heat islands, vegetation, and water bodies—we calculate the NN distance, creating a new set of variables that capture the essence of spatial relationships. These NN features are then joined to the fishnet, enriching our dataset with spatially-derived insights.

After integrating these spatial features, we employ Local Moran’s I statistics to identify spatial autocorrelation within the fishnet grid cells. This step helps us pinpoint clusters of high and low fire incidence, revealing patterns that are not random but rather influenced by the spatial configuration of the landscape and the aforementioned risk factors.

Finally, we assess the correlation between fire count and each spatial feature, including the NN-derived variables. By plotting these correlations and running linear models, we gain a clearer understanding of how each risk factor relates to fire occurrences,

4.1 Making Fishnet

fishnet <- 
  st_make_grid(LA_Boundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[LA_Boundary] %>%    
  st_sf() %>%
  mutate(uniqueID = 1:n())

4.2 Fire Fishnet Visulization

## add a value of 1 to each crime, sum them with aggregate

fire_data <- fire_data %>%
  st_transform(crs = 3857) 



fire_net <- 
  dplyr::select(fire_data) %>% 
  mutate(countFire = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(FIRE = replace_na(countFire, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

fire_net <- st_join(fishnet, fire_net, by = c("uniqueID" = "uniqueID")) %>%
  mutate(countFire = if_else(is.na(countFire), 0, countFire))

ggplot() +
  geom_sf(data = fire_net, aes(fill = countFire), color = NA) +
  scale_fill_viridis() +
  labs(title = "Fire Accident") +
  mapTheme()

4.3 Modeling Spatial Features

Aggregate three features to our fishnet

Here we aggregate heatisland, vegetation cover, and water cover data into our fishnet.

heatisland_net <- 
  dplyr::select(Heatisland) %>% 
  mutate(countHeatisland = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeatisland = replace_na(countHeatisland, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = heatisland_net, aes(fill = countHeatisland), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of HeatIsland for fishnet") +
mapTheme()

vegetation_net <- 
  dplyr::select(vegetation_data) %>% 
  mutate(countvegetation = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countvegetation = replace_na(countvegetation, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

vegetation_net <- st_join(fishnet, vegetation_net, left = TRUE) %>%
  mutate(countvegetation = if_else(is.na(countvegetation), 0, countvegetation)) 

ggplot() +
  geom_sf(data = vegetation_net, aes(fill = countvegetation), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Vegetation Cover for fishnet") +
  mapTheme()

water_net <- st_intersection(fishnet, water_landcover) %>%
  group_by(uniqueID) %>%
  summarize(countwater = n(), .groups = 'drop') %>%
  mutate(countwater = replace_na(countwater, 0))


water_net <- st_join(fishnet, water_net, left = TRUE, join = st_intersects) %>%
  mutate(countwater = if_else(is.na(countwater), 0, countwater)) 

# Plot the result
ggplot() +
  geom_sf(data = water_net, aes(fill = countwater), color = NA) +
  scale_fill_viridis() +
  labs(title = "Count of Water Cover for Fishnet") +
  theme_void()  # Use an appropriate theme

4.4 Nearest Neighbor Feature

fire_net <- 
  dplyr::select(fire_data) %>% 
  mutate(countFire = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(FIRE = replace_na(countFire, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))
water_net <- st_intersection(fishnet, water_landcover) %>%
  group_by(uniqueID) %>%
  summarize(countwater = n(), .groups = 'drop') %>%
  mutate(countwater = replace_na(countwater, 0))

vegetation_net <- 
  dplyr::select(vegetation_data) %>% 
  mutate(countvegetation = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countvegetation = replace_na(countvegetation, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

get_centroid_coords <- function(sf_obj) {
  st_coordinates(st_centroid(st_geometry(sf_obj)))
}

centroids_vars_net <- get_centroid_coords(vars_net)
water_landcover <- water_landcover[!is.na(st_geometry(water_landcover)), ]
vars_net$Heatisland.nn <- nn_function(centroids_vars_net, get_centroid_coords(Heatisland), 3)
vars_net$vegetation.nn <- nn_function(centroids_vars_net, get_centroid_coords(vegetation), 3)

water_landcover <- water_landcover %>% filter(!st_is_empty(.) & !is.na(st_area(.)))
centroids_water <- st_centroid(water_landcover)
centroids_water <- centroids_water[!is.na(st_coordinates(centroids_water)[,1]), ]
vars_net$watercover.nn <- nn_function(centroids_vars_net, get_centroid_coords(water_landcover), 3)
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill = value), colour = NA) +
      scale_fill_viridis(option = "magma") +  # 指定使用magma色板
      labs(title = i) +
      mapTheme()  
}

do.call(grid.arrange, c(mapList, ncol = 3, top = "Nearest Neighbor risk Factors by Fishnet"))

4.5 Join NN feature to our fishnet

## important to drop the geometry from joining features

final_net <-
  left_join(fire_net, st_drop_geometry(vars_net), by="uniqueID") 
final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhood, name), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

# for live demo
# mapview::mapview(final_net, zcol = "District")

Local Moran’s I for fishnet grid cells

## see ?localmoran
local_morans <- localmoran(final_net$FIRE,final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(FIRE_Count = FIRE, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)

## plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(option = "magma") +
      labs(title=i) +
      mapTheme(title_size = 14) + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Fire Risk"))

Correlation

Using NN distance to a hot spot location

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name) %>%
    gather(Variable, Value, -countFire)

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countFire, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countFire)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Fire count as a function of risk factors") +
  plotTheme() 

hist(final_net$countFire, 
     main = "Histogram of Dependent Variable",
     xlab = "Fire",
     ylab = "Frequency",
     col = "blue",  
     border = "black")

5 Modeling and CV

In our approach to modeling wildfire risk, we utilized a robust cross-validation framework to ensure both accuracy and generalizability. Initially, we selected key spatial variables such as proximity to heat islands, vegetation, and water bodies to serve as our independent variables. We then performed regression analyses using these features, adopting both random k-fold and Leave-One-Group-Out (LOGO) cross-validation techniques to rigorously evaluate our model’s performance.

The random k-fold method provided a baseline for model accuracy, while the spatial LOGO cross-validation, which respects the spatial autocorrelation inherent in our data, gave us insights into the model’s generalizability across different geographical contexts.

Our analyses culminated in the comparison of Mean Absolute Error (MAE) distributions, revealing the consistency of our model’s predictive power. The resulting MAE values, alongside their standard deviations, were tabulated to offer a clear summary of our model’s performance under various cross-validation schemes.

This thorough validation process not only bolstered the credibility of our predictive model but also ensured its applicability for insurance companies. By accurately predicting high-risk fire areas, insurers can adjust premiums and identify potential customers in high-risk zones, while providing critical risk index data to real estate platforms, thereby informing potential buyers and influencing property prices.

5.1 Regression

reg.vars <- c("Heatisland.nn", "vegetation.nn", 
              "watercover.nn")

reg.ss.vars <- c("Heatisland.nn", "vegetation.nn", 
              "watercover.nn", "fire.isSig","fire.isSig.dist")

## RUN REGRESSIONS
reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countFire",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countFire, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countFire",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countFire, Prediction, geometry)
  
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countFire",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countFire, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countFire",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countFire, Prediction, geometry)

5.2 Accuracy & Generalzability

multiple map of model errors by random k-fold and spatial cross validation

The MAE histograms indicate that our model’s predictions are more precise when spatial processes are considered, as evidenced by the tighter distribution and lower MAE values in the spatial LOGO-CV. This suggests a strong model performance with high accuracy in predicting wildfire risks.

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countFire, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "#FDE725FF") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "k-fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count") +
    plotTheme()

Table of MAE and standard deviation MAE by regression

The table summarizes the model’s performance, with the lowest MAE and standard deviation observed in the Random k-fold CV: Spatial Process model. This indicates a consistent and accurate prediction capability, while also highlighting the importance of spatial factors in model reliability

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "#FDE725FF") %>%
    row_spec(4, color = "black", background = "#FDE725FF") 
Regression Mean_MAE SD_MAE
Random k-fold CV: Just Risk Factors 0.67 0.63
Random k-fold CV: Spatial Process 0.51 0.50
Spatial LOGO-CV: Just Risk Factors 0.95 0.60
Spatial LOGO-CV: Spatial Process 0.64 0.38

table of raw errors by race context for a random k-fold vs. spatial cross validation regression

The Moran’s I statistics for the residuals of our spatial cross-validation do not show significant autocorrelation, which is indicative of a model that generalizes well across space without overfitting to the training data. This ensures the model’s applicability to diverse geographical areas.

neighborhood.weights <-
  filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV: Spatial Process") %>%
    group_by(cvID) %>%
      poly2nb(as_Spatial(.), queen=TRUE) %>%
      nb2listw(., style="W", zero.policy=TRUE)

filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
    st_drop_geometry() %>%
    group_by(Regression) %>%
    summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[1]],
              p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[3]])
## # A tibble: 2 × 3
##   Regression                         Morans_I p_value
##   <chr>                                 <dbl>   <dbl>
## 1 Spatial LOGO-CV: Just Risk Factors  -0.0322   0.539
## 2 Spatial LOGO-CV: Spatial Process     0.0232   0.387

6 Conclusion

Our analysis directly addresses the use case of enabling insurance companies to adjust policy pricing based on wildfire risk assessment and providing real estate platforms with data on property risk levels. By integrating spatial features and employing cross-validation techniques, we’ve developed a model that not only accurately predicts wildfire incidence but also takes into account the spatial distribution of environmental risk factors. This allows for more nuanced premium setting and risk communication to potential property buyers.

To further enhance the analysis, firstly we need to find more accurate data directly from departments, the current datasets are not that accurate (though we searched all the data and can not find better data in public websites) and it directly impacts the model. Moreover, we could incorporate more dynamic risk factors, such as real-time weather data or changes in land use. Additionally, integrating machine learning algorithms could potentially uncover complex, non-linear patterns within the data. Regular updates with the latest fire incident data would keep the model current, improving its predictive power. Lastly, engaging with domain experts in fire management and insurance underwriting could provide deeper insights, refining the model’s utility in real-world scenarios.