House Price Prediction

Predicting Housing Prices

1 Introduction

This project delves into house price prediction in the context of Philadelphia. It starts with an in-depth data wrangling phase, encompassing comprehensive data cleansing, enhancements, and correlation sourced from Open Data Philly and Github. Additional variables were thoughtfully introduced, facilitating a more comprehensive analysis, and resulting in an array of engaging data visualizations.

An important part of this project is Ordinary Least Squares (OLS) regression, where we illuminate the fundamental determinants wielding substantial influence over property prices. By scrutinizing key attributes such as livable area, bathroom count, and property conditions, we’ve successfully crafted a robust predictive model for housing prices, distinguished by its remarkable precision. Then we performed cross-validation test, subjecting the predictive model to rigorous evaluation. It ensures the model’s unwavering reliability and consistent accuracy, imperative in the real estate and urban planning domains.

Moreover, we completed spatial analysis in this project, transcending traditional regression modeling to account for the spatial distribution of properties and their associated prices. The discernment of spatial patterns within prediction errors uncovers latent geographic influences on housing prices, offering invaluable insights to urban planners and policymakers.

Furthermore, our project delves into the intricate landscape of prediction accuracy, unearthing disparities across neighborhoods and income contexts. By evaluating the model’s performance across diverse regions and income strata, we provide actionable insights relevant to a multitude of stakeholders, including real estate professionals, investors, and policymakers. This treasure trove of insights can underpin well-informed decisions, be it in real estate investments, urban development initiatives, or policy planning.

2 Data section

2.1 Data Wrangling

## Home Features correlation
mean_area <- mean(PA_M.sf$total_livable_area, na.rm = TRUE)
PA_M.sf <- PA_M.sf %>%
  mutate(total_livable_area = ifelse(total_livable_area == 0, mean_area, total_livable_area))

mean_bathrm <- mean(PA_M.sf$number_of_bathrooms, na.rm = TRUE)
PA_M.sf <- PA_M.sf %>%
  mutate(number_of_bathrooms = ifelse(number_of_bathrooms == 0 | is.na(number_of_bathrooms), mean_bathrm, number_of_bathrooms))


mean_distance <- mean(PA_M.sf$closest_distance_to_commercial, na.rm = TRUE)
PA_M.sf <- PA_M.sf %>%
  mutate(closest_distance_to_commercial = ifelse(closest_distance_to_commercial == 0 | is.na(closest_distance_to_commercial), mean_distance, closest_distance_to_commercial))


PA_M.sf$interior_condition <- ifelse(PA_M.sf$interior_condition == 0, 1, PA_M.sf$interior_condition)

PA_M.sf <- PA_M.sf %>%
  mutate(Age = 2023 - year_built)
PA_M.sf$Age <- ifelse(PA_M.sf$Age == -1, 0, PA_M.sf$Age)

mean_depth <- mean(PA_M.sf$depth, na.rm = TRUE)
PA_M.sf$depth[is.na(PA_M.sf$depth)] <- mean_depth

PA_M.sf$quality <- ifelse(PA_M.sf$quality_grade == "A+", 14,
                ifelse(PA_M.sf$quality_grade == "A ", 13,
                ifelse(PA_M.sf$quality_grade == "A-", 12,
                ifelse(PA_M.sf$quality_grade == "B+", 11,
                ifelse(PA_M.sf$quality_grade == "B ", 10,
                ifelse(PA_M.sf$quality_grade == "B-", 9,
                ifelse(PA_M.sf$quality_grade == "C+", 8,
                ifelse(PA_M.sf$quality_grade == "C ", 7,
                ifelse(PA_M.sf$quality_grade == "C-", 6,
                ifelse(PA_M.sf$quality_grade == "D+", 5,
                ifelse(PA_M.sf$quality_grade == "D ", 4,
                ifelse(PA_M.sf$quality_grade == "D-", 3,
                ifelse(PA_M.sf$quality_grade == "E+", 2,
                ifelse(PA_M.sf$quality_grade == "E ", 1,
                ifelse(PA_M.sf$quality_grade == "E-", 0,
                ifelse(PA_M.sf$quality_grade == "2 ", 2,
                ifelse(PA_M.sf$quality_grade == "3 ", 3,
                ifelse(PA_M.sf$quality_grade == "4 ", 4,
                ifelse(PA_M.sf$quality_grade == "5 ", 5,
                ifelse(PA_M.sf$quality_grade == "6 ", 6, NA_real_))))))))))))))))))))
PA_M.sf$quality <- ifelse(PA_M.sf$quality == 0, 1, PA_M.sf$quality)

2.2 Summary Table for Variable Descriptions

PA_M.df <- as.data.frame(PA_M.sf)

# Internal Characteristics
internal_stats <- PA_M.df %>%
  select(total_livable_area, number_of_bedrooms, number_stories, interior_condition) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarise(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Q1 = quantile(value, 0.25, na.rm = TRUE),
    Q3 = quantile(value, 0.75, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  ) %>%
  mutate(category = "Internal Characteristics")

# Spatial Characteristics
spatial_stats <- PA_M.df %>%
  select(closest_distance_to_commercial, depth) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarise(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Q1 = quantile(value, 0.25, na.rm = TRUE),
    Q3 = quantile(value, 0.75, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  ) %>%
  mutate(category = "Spatial Characteristics")

# External Characteristics
external_stats <- PA_M.df %>%
  select(exterior_condition, Age, quality) %>%
  gather(variable, value) %>%
  group_by(variable) %>%
  summarise(
    mean = mean(value, na.rm = TRUE),
    sd = sd(value, na.rm = TRUE),
    Q1 = quantile(value, 0.25, na.rm = TRUE),
    Q3 = quantile(value, 0.75, na.rm = TRUE),
    Min = min(value, na.rm = TRUE),
    Max = max(value, na.rm = TRUE)
  ) %>%
  mutate(category = "External Characteristics")


all_stats <- rbind(internal_stats, spatial_stats, external_stats)
styled_table <- kable(all_stats, format = "html", digits = 2, caption = "Summary Statistics", booktabs = TRUE) %>%
  kable_styling(full_width = FALSE) %>%
  add_header_above(c(" " = 1, "Summary Statistics" = 6, " " = 1), font_size = 15) %>%
  group_rows("Internal Characteristics", 1, 4) %>%
  group_rows("Spatial Characteristics", 5, 6) %>%
  group_rows("External Characteristics", 7, 9)

styled_table
Summary Statistics
Summary Statistics
variable mean sd Q1 Q3 Min Max category
Internal Characteristics
interior_condition 3.65 0.89 3.00 4.00 1.00 7.00 Internal Characteristics
number_of_bedrooms 2.58 1.27 2.00 3.00 0.00 31.00 Internal Characteristics
number_stories 1.96 0.55 2.00 2.00 1.00 5.00 Internal Characteristics
total_livable_area 1335.85 546.46 1050.00 1474.00 64.00 15246.00 Internal Characteristics
Spatial Characteristics
closest_distance_to_commercial 717.68 611.87 302.14 954.61 0.07 7385.67 Spatial Characteristics
depth 76.40 36.42 55.00 98.00 0.00 1115.00 Spatial Characteristics
External Characteristics
Age 86.32 51.14 73.00 103.00 0.00 2023.00 External Characteristics
exterior_condition 3.76 0.88 4.00 4.00 1.00 7.00 External Characteristics
quality 7.38 1.09 7.00 8.00 1.00 14.00 External Characteristics

This table offers a simple summary of important variable statistics. It’s divided into three main groups: Internal Characteristics, Spatial Characteristics, and External Characteristics. The numbers in the table show us basic information about these characteristics. For example, they tell us the average values, which give us a sense of what’s typical. The table also provides information about how spread out the data is, sort of like a measure of how much the values vary. Additionally, it shows the range, which is just the span from the lowest to the highest values for each of these aspects of housing. These statistics help us get a basic grasp of the data we’re working with and are useful for our housing price prediction analysis.

2.3 Correlation Matrix

PACrimes.sf <-
  PACrimes %>%
    filter(text_general_code == "Aggravated Assault Firearm",
           lat > -1) %>%
    dplyr::select(lat, lng) %>%
    na.omit() %>%
    st_as_sf(coords = c("lng", "lat"), crs = "EPSG:4326") %>%
    st_transform('ESRI:102728') %>%
    distinct()
#numericVars <- 
 # select_if(st_drop_geometry(PA_M.sf), is.numeric) %>% na.omit()

numericVars <- PA_M.sf %>% 
  st_drop_geometry() %>% 
  mutate(Age = 2023 - year_built) %>%
  select(total_livable_area, total_area, Age, number_of_bathrooms, number_of_bedrooms, crime_nn1, crime_nn2, crime_nn3, crime_nn4, crime_nn5) %>%
  na.omit()

ggcorrplot(
  round(cor(numericVars), 1), 
  p.mat = cor_pmat(numericVars),
  colors = c("#25CB10", "white", "#FA7800"),
  type="lower",
  insig = "blank") +  
    labs(title = "Correlation across numeric variables") 

# yet another way to plot the correlation plot using the corrr library
numericVars %>% 
  correlate() %>% 
  autoplot() +
  geom_text(aes(label = round(r,digits=2)),size = 2)

In this section, we examined how different aspects of houses, like their size, age, and the number of bedrooms, are related to one another. This helps us understand if some features tend to go together. Additionally, we’re looking at the connection between housing and crime rates, specifically focusing on places with a higher occurrence of aggravated assault with firearms. This helps us understand how safety concerns might affect property values, which is essential in real estate analysis.

2.4 Four Home Price Correlation Scatterplots

num_zero_or_na <- sum(PA_M.sf$total_livable_area == 0 | is.na(PA_M.sf$total_livable_area)) #check the NAdata and 0 data inside Total_livable)area
num_zero_or_na2 <- sum(PA_M.sf$total_area == 0 | is.na(PA_M.sf$total_area))
st_drop_geometry(PA_M.sf) %>% 
  dplyr::select(sale_price,  total_livable_area,exterior_condition, interior_condition, quality) %>%
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 3, scales = "free") +
     labs(title = "Price as a function of continuous variables")+
  plotTheme()

This section presents scatterplots that showcase the relationships between home sale prices and various factors, such as the total livable area, exterior condition, interior condition, and overall quality. These visualizations help us understand how these factors correlate with property prices, providing valuable insights for property buyers and sellers.

2.5 Mapping Houseing Price per square

ggplot() +
  geom_sf(data = PAnhoods, fill = "grey60") +
  geom_sf(data = PA_M.sf, aes(colour = q5(sale_price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(PA_M.sf,"sale_price"),
                   name="Sale Price Value") +
  labs(title="House Sale Price Per Square Foot, PA") +
  mapTheme()

2.6 Maps of 3 Independent Variables

ggplot() +
  geom_sf(data = PAnhoods, fill = "grey60") +
  geom_sf(data = PA_M.sf, aes(colour = q5(quality)), 
          show.legend = "point", size = .65) +
  scale_colour_manual(values = palette5,
                   labels=qBr(PA_M.sf,"quality"),
                   name="House Quality Grade") +
  labs(title="House Quality Per Square Foot, PA") +
  mapTheme()

ggplot() +
  geom_sf(data = PAnhoods, fill = "grey60") +
  geom_sf(data = PA_M.sf, aes(colour = q5(as.numeric(Age))), 
          show.legend = "point", size = .65) +
  scale_colour_manual(values = palette5,
                      labels = qBr(PA_M.sf, "Age"),
                      name = "House Age") +
  labs(title = "House Age Per Square Foot, PA") +
  mapTheme()

ggplot() +
  geom_sf(data = PAnhoods, fill = "grey60") +
  geom_sf(data = PA_M.sf, aes(colour = q5(as.numeric(Age))), 
          show.legend = "point", size = .65) +
  scale_colour_manual(values = palette5,
                      labels = qBr(PA_M.sf, "total_livable_area"),
                      name = "livable Area Value") +
  labs(title = "House Age Per Square Foot, PA") +
  mapTheme()

This section includes maps displaying three independent variables: house quality grade, house age, and livable area value. These maps provide a visual representation of how these variables vary across different areas in Philadelphia. Analyzing such spatial patterns can offer insights for real estate professionals, urban planners, and policymakers, helping them make informed decisions related to housing and development in the city.

2.7 Extra Interesting Maps explored

2.7.1 crime data visualization

PA_M.sf %>%
  st_drop_geometry() %>%
  mutate(Age = 2023 - year_built) %>%
  dplyr::select(sale_price, starts_with("crime_")) %>%
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, nrow = 2, scales = "free") +
     labs(title = "Price as a function of continuous variables") 

## Plot assault density
ggplot() + geom_sf(data = PAnhoods, fill = "grey40") +
  stat_density2d(data = data.frame(st_coordinates(PACrimes.sf)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 linewidth = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "#25CB10", high = "#FA7800", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Aggravated Assaults with Firearm, PA") +
  mapTheme()

In this section, we present visualizations of crime data related to aggravated assaults with firearms. These visualizations help us understand the impact of crime density and type on property prices in Philadelphia. By examining the relationships between crime data and housing prices, we can gain valuable insights into the city’s real estate market and safety considerations. These insights are essential for real estate professionals, city planners, and policymakers when making decisions related to urban development and public safety.

2.7.2 Univarite correlation -> multi-variate OLS regression

2.7.3 Univarite Regression

In this section, we delve into univariate regression analysis, which focuses on the relationship between a single independent variable, “total livable area,” and the dependent variable, “sale price.” We aim to understand how changes in the livable area impact property prices in Philadelphia. Our analysis uses statistical tests and visualizations to reveal correlations and regression results.

The univariate regression analysis helps us comprehend the linear relationship between total livable area and sale price, providing insights into the influence of property size on pricing. This is valuable information for potential homebuyers, sellers, and real estate professionals in the Philadelphia area.

3 Method

Based on housing price data and other variables influencing housing prices from OpenDataPhilly, this study commenced by gathering potential determinants of housing prices, followed by data processing and cleaning. Subsequently, spatial lag analysis was employed to compute spatial lag errors, and the Moran’s I Test was instituted to assess spatial autocorrelation, ensuring the suitability of the spatial model. An OLS regression model was then formulated, integrating multiple explanatory variables such as total livable area, number of bathrooms, exterior condition, interior condition, number of stories, depth, Age, house quality, and distance to commercial area, aiming to predict housing prices. Ultimately, based on the established model, housing prices for both known and unknown properties were predicted, with these projected values juxtaposed against the actual sale prices. By calculating the predictive error from the model, a spatial representation was provided. In essence, this research synthesized data handling, spatial analysis, regression modeling, error scrutiny, and data visualization to devise an accurate housing price prediction model.

4 Results

4.1 Summary Table for Training Set

library(broom)
library(knitr)
library(kableExtra)

PA_M_join <- st_join(PA_M.sf, PAnhoods, left = TRUE)
PA_M_join_modeling <- PA_M_join[PA_M_join$toPredict == "MODELLING", ]
set.seed(123)
inTrain <- createDataPartition(y = PA_M_join_modeling$sale_price, p = 0.85, list = FALSE)
PA.training <- PA_M_join_modeling[inTrain, ]
PA.test <- PA_M_join_modeling[-inTrain, ]

model <- lm(sale_price ~ total_livable_area + number_of_bathrooms + exterior_condition + 
            interior_condition + number_stories + depth + Age + quality + 
            closest_distance_to_commercial, data = PA.training)

model_summary <- tidy(model)

kable(model_summary, caption = "Detailed OLS Regression Results", format = "html") %>%
  kable_styling()
Detailed OLS Regression Results
term estimate std.error statistic p.value
(Intercept) -11854.967170 13877.894595 -0.8542338 0.3929858
total_livable_area 208.204512 3.145602 66.1890733 0.0000000
number_of_bathrooms 77034.048072 2961.954184 26.0078459 0.0000000
exterior_condition -32830.173753 2074.113506 -15.8285328 0.0000000
interior_condition -29103.440659 2114.265215 -13.7652743 0.0000000
number_stories 2722.894441 2422.076392 1.1241984 0.2609425
depth -591.037789 38.404096 -15.3899676 0.0000000
Age 70.950815 34.208862 2.0740478 0.0380878
quality 23117.972614 1293.059556 17.8785057 0.0000000
closest_distance_to_commercial 6.800683 2.019235 3.3679505 0.0007587
PA.test <-
  PA.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(model, PA.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 4000000) 
PA_M_join <- st_join(PA_M.sf, PAnhoods, left = TRUE)
PA_M_join_modeling <- PA_M_join[PA_M_join$toPredict == "MODELLING", ]
set.seed(123)
inTrain <- createDataPartition(y = PA_M_join_modeling$sale_price, p = 0.85, list = FALSE)
PA.training <- PA_M_join_modeling[inTrain, ]
PA.test <- PA_M_join_modeling[-inTrain, ]

model <- lm(sale_price ~ total_livable_area + number_of_bathrooms + exterior_condition + 
            interior_condition + number_stories + depth + Age + quality + 
            closest_distance_to_commercial, data = PA.training)

stargazer(model, title="OLS Regression Results", type="text", align=TRUE)
## 
## OLS Regression Results
## ===========================================================
##                                    Dependent variable:     
##                                ----------------------------
##                                         sale_price         
## -----------------------------------------------------------
## total_livable_area                      208.205***         
##                                          (3.146)           
##                                                            
## number_of_bathrooms                   77,034.050***        
##                                        (2,961.954)         
##                                                            
## exterior_condition                    -32,830.170***       
##                                        (2,074.114)         
##                                                            
## interior_condition                    -29,103.440***       
##                                        (2,114.265)         
##                                                            
## number_stories                          2,722.894          
##                                        (2,422.076)         
##                                                            
## depth                                  -591.038***         
##                                          (38.404)          
##                                                            
## Age                                      70.951**          
##                                          (34.209)          
##                                                            
## quality                               23,117.970***        
##                                        (1,293.060)         
##                                                            
## closest_distance_to_commercial           6.801***          
##                                          (2.019)           
##                                                            
## Constant                               -11,854.970         
##                                        (13,877.900)        
##                                                            
## -----------------------------------------------------------
## Observations                              19,826           
## R2                                        0.481            
## Adjusted R2                               0.481            
## Residual Std. Error              167,368.000 (df = 19816)  
## F Statistic                    2,043.955*** (df = 9; 19816)
## ===========================================================
## Note:                           *p<0.1; **p<0.05; ***p<0.01
PA.test <-
  PA.test %>%
  mutate(Regression = "Baseline Regression",
         sale_price.Predict = predict(model, PA.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price) %>%
  filter(sale_price < 4000000) 

These two tables illustrates the results of an Ordinary Least Squares (OLS) regression analysis used to predict house sale prices in Philadelphia. It reveals that factors such as the total livable area, number of bathrooms, and quality have a positive impact on the sale price, with a unit increase in total livable area leading to a $208.205 increase in the sale price. Conversely, attributes like exterior condition, interior condition, and depth negatively influence the price. Notably, this model captures approximately 48.1% of the variability in sale prices, and the significance of the F-statistic confirms the collective relevance of the predictors in the model.

4.2 Summary Table of MAE and MAPE for Test Set

reg.nhood <- lm(sale_price ~ ., data = as.data.frame(PA.training) %>% 
                                 dplyr::select(name,sale_price, total_livable_area, 
                                           number_of_bathrooms, exterior_condition,
                                           interior_condition, number_stories, depth, Age, quality,
                                           closest_distance_to_commercial))

PA.test.nhood <-
  PA.test %>%
  mutate(Regression = "Neighborhood Effects",
         sale_price.Predict = predict(reg.nhood, PA.test),
         sale_price.Error = sale_price.Predict- sale_price,
         sale_price.AbsError = abs(sale_price.Predict- sale_price),
         sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)
coords <- st_coordinates(PA_M_join) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

PA_M_join$lagPrice <- lag.listw(spatialWeights, PA_M_join$sale_price)
PA.test.nhood <- PA.test.nhood %>%
  mutate(sale_price.Error = ifelse(is.finite(sale_price.Error), sale_price.Error, 0))
PA.test <- PA.test %>%
  mutate(sale_price.Error = ifelse(is.finite(sale_price.Error), sale_price.Error, 0))
coords.test <- st_coordinates(PA.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style = "W")

replacement_value <- 0 
PA.test$sale_price.Error[is.na(PA.test$sale_price.Error)] <- replacement_value

PA.test <- PA.test %>%
  filter(is.finite(sale_price.Error)) %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, replace(sale_price.Error, is.na(sale_price.Error), 0)))


bothRegressions <- 
  rbind( dplyr::select(PA.test, starts_with("sale_price"), Regression, name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)),
    
    dplyr::select(PA.test.nhood, starts_with("sale_price"), Regression, name) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)))  
st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -name) %>%
  filter(Variable == "sale_price.AbsError" | Variable == "sale_price.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable()
Regression sale_price.AbsError sale_price.APE
Baseline Regression 107245.50 0.6733848
Neighborhood Effects 71136.38 0.3910088

4.3 Cross-Validation Test

data(PA_M.sf)
set.seed(123)

ctrl <- trainControl(method = "cv", 
                     number = 100, 
                     summaryFunction = defaultSummary,
                     verboseIter = FALSE)


reg.cv <- train(sale_price ~ ., 
               data = st_drop_geometry(PA_M.sf)%>%
               dplyr::select(sale_price, total_livable_area,
                              number_of_bathrooms, exterior_condition,
                              interior_condition, 
                              number_stories, 
                              depth, 
                              Age, 
                              quality, 
                              closest_distance_to_commercial),
               method = "lm",
               metric = "MAE",
               trControl = ctrl, na.action =na.pass)

results <- reg.cv$resample

ggplot(results, aes(x = RMSE)) + 
  geom_histogram(fill = "skyblue", color = "black", bins = 30) + 
  labs(title = "Histogram of Cross-Validation MAE",
       x = "Mean Absolute Error (MAE)",
       y = "Frequency")

mean_mae <- mean(results$RMSE)
sd_mae <- sd(results$RMSE)

stats <- data.frame(Statistic = c("Mean MAE", "Standard Deviation of MAE"),
                       Value = c(mean_mae, sd_mae))

kable(stats, "html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Statistic Value
Mean MAE 166540.99
Standard Deviation of MAE 38966.81

The histogram of the Mean Absolute Error (MAE) reveals the model’s accuracy in its predictions. Most errors cluster around the 200,000 mark, suggesting a typical deviation of this magnitude from actual prices. Additionally, the table provides a summary of the average MAE and its variability (standard deviation). The Mean MAE of 165,540.99 suggests that, on average, your model’s predictions are approximately 165,541 dollars off from the actual house prices. This can be considered a significant deviation, particularly depending on the average house prices in your dataset. The Standard Deviation of MAE being 38,966.81 indicates the variability or spread of the prediction errors. In simpler terms, while on average the error is around $165,541, the errors can deviate by about 38,967 dollars above or below this mean.

4.4 Predicted Prices as a Function of Observed Prices

bothRegressions %>%
  dplyr::select(sale_price.Predict, sale_price, Regression) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  geom_point() +
  stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(sale_price.Predict, sale_price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme()

This plot illustrates the relationship between the predicted sale prices and the actual observed prices. The distribution of data points predominantly hugging the orange line suggests that the model’s predictions are generally accurate. The green line in the model closely aligns with the orange line, further indicating the precision of the predictions. However, there are regions where the green line diverges from the orange line, pointing to potential inaccuracies in predictions for those areas. Moreover, a few outliers, distant from both the green and orange lines, highlight specific areas where the model might be less reliable.

4.5 Map of Residuals for Test set

4.5.1 Plot of the Spatial Lag in Errors

From the plot, we observe a relationship between the prediction errors (sale_price.Error) and the spatial lag of these errors (lagPriceError). The spatial lag error essentially captures the relationship between a point’s prediction error and the average prediction error of its neighboring points. While the majority of data points cluster around the central area, suggesting that most predictions and their spatial lag errors are relatively small, there are some outliers distant from this cluster. This disparity could indicate larger prediction discrepancies in certain geographical locations compared to their surrounding areas. Such spatial distribution hints that the model might not have fully accounted for some geographical or neighboring factors. For instance, a unique environmental or economic characteristic in a specific area may result in significant price differences from its surrounding regions in PA, which the model might have overlooked, leading to these prediction errors.

4.5.2 Moran’s I Test Result

moranTest <- moran.mc(PA.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

This test plot depicts the observed Moran’s I value, indicated by the orange line, in relation to the distribution of Moran’s I values obtained from random permutations. As seen from the chart, most of the random Moran’s I values cluster around 0, while the observed Moran’s I value, marked by the orange line, is noticeably shifted to the right. This suggests a certain positive spatial autocorrelation in the model’s errors, indicating that adjacent areas exhibit similar modeling errors.

4.6 A Map of Predicted Values

nearest_commercial <- st_nearest_feature(PA.sf, commercial.sf)
closest_distances <- numeric(length(PA.sf))

PA.sf$closest_distance_to_commercial <- NA


for (i in 1:nrow(PA.sf)) {
  distances <- st_distance(PA.sf[i, ], commercial.sf)
  min_distance <- min(distances, na.rm = TRUE)
  PA.sf$closest_distance_to_commercial[i] <- min_distance
}
mean_area <- mean(PA.sf$total_livable_area, na.rm = TRUE)
PA.sf <- PA.sf %>%
  mutate(total_livable_area = ifelse(total_livable_area == 0, mean_area, total_livable_area))

mean_bathrm <- mean(PA.sf$number_of_bathrooms, na.rm = TRUE)
PA.sf <- PA.sf %>%
  mutate(number_of_bathrooms = ifelse(number_of_bathrooms == 0 | is.na(number_of_bathrooms), mean_bathrm, number_of_bathrooms))


mean_distance <- mean(PA.sf$closest_distance_to_commercial, na.rm = TRUE)
PA.sf <- PA.sf %>%
  mutate(closest_distance_to_commercial = ifelse(closest_distance_to_commercial == 0 | is.na(closest_distance_to_commercial), mean_distance, closest_distance_to_commercial))


PA.sf$interior_condition <- ifelse(PA.sf$interior_condition == 0, 1, PA.sf$interior_condition)

PA.sf <- PA.sf %>%
  mutate(Age = 2023 - year_built)
PA.sf$Age <- ifelse(PA.sf$Age == -1, 0, PA.sf$Age)

mean_depth <- mean(PA.sf$depth, na.rm = TRUE)
PA.sf$depth[is.na(PA.sf$depth)] <- mean_depth

PA.sf$quality <- ifelse(PA.sf$quality_grade == "A+", 14,
                ifelse(PA.sf$quality_grade == "A ", 13,
                ifelse(PA.sf$quality_grade == "A-", 12,
                ifelse(PA.sf$quality_grade == "B+", 11,
                ifelse(PA.sf$quality_grade == "B ", 10,
                ifelse(PA.sf$quality_grade == "B-", 9,
                ifelse(PA.sf$quality_grade == "C+", 8,
                ifelse(PA.sf$quality_grade == "C ", 7,
                ifelse(PA.sf$quality_grade == "C-", 6,
                ifelse(PA.sf$quality_grade == "D+", 5,
                ifelse(PA.sf$quality_grade == "D ", 4,
                ifelse(PA.sf$quality_grade == "D-", 3,
                ifelse(PA.sf$quality_grade == "E+", 2,
                ifelse(PA.sf$quality_grade == "E ", 1,
                ifelse(PA.sf$quality_grade == "E-", 0,
                ifelse(PA.sf$quality_grade == "2 ", 2,
                ifelse(PA.sf$quality_grade == "3 ", 3,
                ifelse(PA.sf$quality_grade == "4 ", 4,
                ifelse(PA.sf$quality_grade == "5 ", 5,
                ifelse(PA.sf$quality_grade == "6 ", 6, NA_real_))))))))))))))))))))
PA.sf$quality <- ifelse(PA.sf$quality == 0, 1, PA.sf$quality)
PA_predict <- PA.sf[PA.sf$toPredict %in% c("MODELLING", "CHALLENGE"), ]

predicted_values <- predict(model, newdata = PA_predict)
PA_predict$predicted_sale_price <- predicted_values

ggplot() +
  geom_sf(data = PAnhoods, fill = "grey40") +
  geom_sf(data = PA_predict, aes(colour = q5(predicted_sale_price)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(PA_predict,"predicted_sale_price"),
                   name="Quintile\nBreaks") +
  labs(title="Predicted Price Per Square Foot, PA") +
  mapTheme()

Challenge_results <- PA_predict %>% filter(toPredict == "CHALLENGE")

selected_data <- PA_predict%>% select(sale_price, predicted_sale_price)
neg <- PA_predict %>% filter(predicted_sale_price < 0)

4.7 A Map of Mean Absolute Percentage Error (MAPE) by Neighborhood

Predictions by neighborhood

st_drop_geometry(bothRegressions) %>%
  group_by(Regression, name) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(PAnhoods, by = c("name" = "name")) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = bothRegressions, colour = "grey40", size = .05) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood") +
      mapTheme()

In the “Baseline Regression” map, we observe that while a majority of the regions are shaded in green, indicating a lower prediction error, there are still several areas, particularly in the central regions, that are colored in orange or red. This denotes that these specific areas have relatively larger prediction errors. On the other hand, The “Neighborhood Effects” map provides a visualization of the MAPE values across different neighborhoods when predicting housing prices.In the “Neighborhood Effects” map, the distribution of prediction errors appears more uniform, with most of the regions being shaded green. This suggests that by incorporating neighborhood effects, the model’s predictive performance has been enhanced.

4.8 Scatterplot plot of MAPE by Nhood as a function of mean price by Nhood

PA_test_df <- st_drop_geometry(PA.test)
bothRegressions_df <- st_drop_geometry(bothRegressions)


neighborhood_data <- merge(PA_test_df %>% select(name, sale_price),
                           bothRegressions_df %>% select(name, sale_price.APE),
                           by = "name")

neighborhood_mean_price <- neighborhood_data %>%
  group_by(name) %>%
  summarize(mean_price = mean(sale_price, na.rm = TRUE))

neighborhood_data <- merge(neighborhood_mean_price, neighborhood_data, by = "name")

ggplot(neighborhood_data, aes(x = mean_price, y = sale_price.APE)) +
  geom_point() +
  labs(x = "Neighborhood Mean Price", y = "Mean MAPE") +
  ggtitle("MAPE by Neighborhood Mean Price")

This scatterplot illustrates the relationship between average neighborhood house prices (represented on the x-axis) and the associated Mean Absolute Percentage Error (MAPE, depicted on the y-axis) of predictions. The majority of data points, denoted in black, cluster around the lower end of the MAPE axis, especially for mid-range neighborhood prices. This indicates that the model tends to produce more accurate predictions with lower errors for neighborhoods with average to slightly above-average house prices. As we transition towards the right side of the chart, where neighborhoods boast higher average prices, there’s a noticeable increase in the variability of MAPE values. This suggests some inconsistencies in the prediction accuracy for these upscale neighborhoods. There’s a notable concentration of data points between the average price values of 600000 to 700000, hinting that the model might consistently predict within a certain error percentage range for these prices.

4.9 Income Context of Predictions

grid.arrange(ncol = 1,
  ggplot() + geom_sf(data = na.omit(tracts21), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

st_join(bothRegressions, tracts21) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context")
Test set MAPE by neighborhood income context
Regression High Income Low Income
Baseline Regression 33% 94%
Neighborhood Effects 26% 49%

This map illustrates the income context of various neighborhoods in Philadelphia. Areas represented in orange are designated as “High Income,” while those in green are categorized as “Low Income.” It’s evident from the visualization that certain parts of Philadelphia, especially the central regions, belong to the high-income bracket, whereas the peripheral areas lean more towards the low-income category.

The table offers insights into the predictive accuracy, represented by MAPE, of two distinct regression models: the Baseline Regression and the Neighborhood Effects. For the Baseline Regression, there’s a prediction error of 33% in high-income areas and a substantial 94% in low-income ones. Meanwhile, with the Neighborhood Effects model, the error margin narrows to 26% for high-income locales and 49% for their low-income counterparts. This data suggests a noticeable enhancement in the model’s predictive prowess across both income contexts upon incorporating neighborhood effects. The improvement is especially prominent in the low-income areas, where the error has been slashed by almost half.

Furthermore, as gleaned from the previous “MAPE by Neighborhood” map, the distribution of predictive errors becomes more even-handed once neighborhood effects are factored in, with the majority of areas shaded in green. This implies a consistent predictive performance by the model across all neighborhoods.The study’s model, therefore, stands out in its ability to adapt to diverse geographical and socio-economic contexts, underlining its commendable generalizability.

5 Discussion

Our analysis has provided comprehensive insights into housing price modeling for Philadelphia, revealing the efficacy of our model and the significance of various variables. Incorporating neighborhood effects has been demonstrated as pivotal in enhancing the model’s predictive accuracy across high and low-income regions. Against the backdrop of Philadelphia’s diverse socio-economic settings, our model has showcased admirable adaptability, indicating a commendable degree of generalizability.

One of the intriguing variables included the total_livable_area, which as expected, displayed a positive correlation with sale prices; every increment in the unit of livable area is predicted to raise the sale price by 208.205 units. Similarly, the number_of_bathrooms also positively correlated with prices. On the contrary, both exterior_condition and interior_condition have negative coefficients, implying that a lower score in these conditions could likely result in a lower predicted price. The positive coefficients of number_stories and quality denote their direct relationship with predicted sale prices, with higher quality homes understandably fetching a higher price. Lastly, the closest_distance_to_commercial indicates that properties nearer to commercial zones tend to be predicted at a higher price, potentially due to the convenience it offers.

In terms of capturing price variations, especially post incorporating neighborhood effects, the model successfully accounted for a significant part of the variability. The spatial distribution of MAPE attests to the model’s capability to capture spatial variations in prices. Notably, certain central regions of Philadelphia, typically associated with higher income, displayed pronounced predictive consistency and accuracy. Conversely, some peripheral regions manifested relatively higher predictive errors. These discrepancies could potentially stem from the intrinsic complexities of these regions, with unaccounted factors, or the model’s sensitivity to outliers and extremities. The stark contrast in predictive errors between high and low-income areas in the baseline regression model underscores the importance of considering neighborhood-specific characteristics. Upon introducing the neighborhood effects, a substantial reduction in predictive errors was observed, especially in traditionally challenging low-income zones.

Moreover, from the “Mean test set MAPE by neighborhood” chart, the “Baseline Regression” predictions indicated some significant errors in Philadelphia’s central regions, particularly the high-income areas. However, when “Neighborhood Effects” were introduced, most areas turned green, denoting decreased errors and improved prediction accuracy, emphasizing the precision gained by accounting for the unique dynamics within neighborhoods. Additionally, the “MAPE by Neighborhood Mean Price” chart indicates that the model excels particularly in neighborhoods with an average price around 6e+05. But for other price tiers, especially the more expensive ones, the model’s predictions tended to be less accurate. Such discrepancies might be attributed to intricate factors affecting housing prices in these regions, such as Philadelphia’s geographic location, socio-economic backdrop, or cultural heritage. For instance, our data included incredibly affordable houses priced around a couple of tens of thousands, juxtaposed with homes nearing a million with significant age and inferior quality but still fetching an exorbitant price. The presence of such anomalies could indeed influence price predictions.

6 Conclusion

Based on our analysis of real estate prices in Philadelphia, I would recommend our model to Zillow. Firstly, the model effectively fuses neighborhood effects, which significantly improves the prediction accuracy. This has been demonstrated in different socio-economic contexts, particularly in high-and low-income areas. In addition, by analyzing several variables, such as total livable area, number of bathrooms, exterior condition, age, and house quality, we can better understand how they relate to house prices, that will provide Zillow with more targeted information.

However, each model has its limitations. Although our model performs well in some neighborhoods, in some area, especially those neighborhoods with high prices or some particularities, the prediction error is large. This may be due to the inherent complexity of those areas or the sensitivity of the model to outliers. Therefore, to further improve the generalizability and accuracy of the model, we can consider introducing more features, such as recent real estate sales data, community education facilities and transportation convenience. In addition, the use of deep learning or integrated learning methods may also improve predictive performance. Finally, the regular use of new data to update the model to adapt to the changes in the real estate market is also a key step to improve the accuracy of the model.

In summary, I would recommend that Zillow consider using our model, but also recommend continuous improvement and optimization in terms of accuracy and consistency of data, to ensure a high level of forecast accuracy in a changing real estate market.