Technical Appendix

NHD&S Solutions

Data Preparation

Properties Data

We begin our process to create models for predicting house prices in Philadelphia by loading a comprehensive dataset of property sales from the Philadelphia Office of Property Assessment. The raw dataset includes a wide range of attributes describing the sale of the property, assessment of the property, structural characteristics, and spatial information about location. The raw dataset includes 583,588 observations (rows) and 80 variables (columns). Because our goal is to build a predictive model for residential sale prices using structural information, census data, and spatial amenities, we prepare the data by cleaning it up using a series of steps.

First, only variables relevant to residential property valuation were retained. We include transaction variables, structural characteristics, and geographic identifiers. Irrelevant or mostly incomplete fields were removed to improve model stability.

Next, we restrict the dataset to transactions occurring in 2024 and 2025 for relevancy. By limiting the analysis to a narrower time window, we aim to reduce distortions caused by inflation and market cycles while simultaneously decreasing the computational size of the dataset. We then filter the dataset further to remove problematic observations. We excluded transactions with missing sale prices and limited the data to only residential properties (category codes 1 and 2).

Further, extremely low sales below $10,000 were removed because these would skew the model as they do not typically represent the true value of the home and are sales made under special circumstances. Additional outliers were identified and removed using a three-standard-deviation rule to account for unusually high or low sale prices that could distort regression estimates.

Next, we address structural outliers. For example, a small number of properties reported extremely large livable areas (30,000 to 70,000 square feet), which are inconsistent with typical residential housing stock. We remove these properties to maintain a realistic distribution of housing characteristics. Lastly, we removed observations that were missing key structural variables we wanted to examine and possibly include in our models (for example, bedrooms, bathrooms, garage spaces).

After cleaning, we transformed our spatial coordinate system to align with the Philadelphia County boundary (EPSG:2272).

Packages used are below

Show Code
library(tidyverse)
library(sf)
library(tidyverse)
library(here)
library(lubridate)
library(tigris)
library(ggplot2)
library(scales)
library(viridis)
library(tidycensus)
library(FNN)
library(caret)
library(kableExtra)
library(broom)
options(scipen = 999)
options(tigris_use_cache = TRUE)
options(tidycensus.show_progress = FALSE)
sf::sf_use_s2(FALSE)
options(sf.quiet = TRUE)

Code for properties data cleaning

Show Code
#Philly_data <- st_read(here("labs/lab_3/data/opa_properties_public.geojson"), quiet = TRUE)

#columns_keep <- c("building_code_new", "category_code", "central_air", "exterior_condition", "garage_spaces", "location", "sale_price", "number_of_bathrooms", "number_of_bedrooms", "number_of_rooms", "number_stories", "quality_grade", "sale_date", "total_area", "total_livable_area", "year_built", "objectid", "geometry") #Columns that we are selecting out from data

#Philly_data_working <- Philly_data %>%
 # select(columns_keep)

#Philly_data_working <- Philly_data_working %>%
  #mutate(sale_year = year(sale_date))

#Philly_data_working <- Philly_data_working %>%
  #filter(sale_year == 2024 | sale_year == 2025) #Note that we used: 2024 and 2025 to save on data size - also makes data less subject to inflation

#Philly_data_working <- Philly_data_working %>%
  #filter(!is.na(sale_price)) %>%
  #filter(category_code == 1 | category_code == 2) #code 1 is for residential and code 2 is for apartments and hotels

#Wrote out the philly_data_working set here; going to read it back in so hopefully this will save space on github

Philly_data_working <- st_read(here("labs/lab_3/data/Philly_data_working.geojson"), quiet = TRUE)

phillyboundary <- counties(state = "PA", cb = TRUE) %>%
  subset(NAME == "Philadelphia")

phillyboundary <- st_transform(phillyboundary, crs = 2272)
Philly_data_working <- st_transform(Philly_data_working, st_crs(phillyboundary))

#all of the houses are inside of Philly -- LETS GO!
clipped_outside <- st_difference(Philly_data_working, st_union(phillyboundary))

#Philly_data_working <- Philly_data_working %>%
  #filter(sale_price > 5000) #tried this, but I think I want a more statistically robust method.

Philly_data_working <- Philly_data_working %>%
  filter(sale_price > 10000) %>%
  filter(between(
    sale_price,
    mean(sale_price, na.rm = TRUE) - 3*sd(sale_price, na.rm = TRUE),
    mean(sale_price, na.rm = TRUE) + 3*sd(sale_price, na.rm = TRUE)
  )) #moving outliers - first removing sales that are less than 10K, in order to get a more accurate picture of average private market sales, then looking at outliers that are more than 3 standard deviations from the mean - think this is a good method

#After exploratory analysis of total livable area, there are two crazy outliers @30,000 sqft and 70,000sqft. going to remove these

Philly_data_working <- Philly_data_working %>%
  filter(total_livable_area < 30000)

#now want to clean up bedrooms, bathrooms, stories, year-built, and quality grade, as these are going to be important factors to have a complete data set - also garage spaces and building code new

Philly_data_working <- Philly_data_working %>%
  filter(!is.na(sale_price) & !is.na(number_of_bedrooms) & !is.na(number_of_bathrooms) & !is.na(quality_grade) & !is.na(year_built) & !is.na(year_built) & !is.na(garage_spaces) & !is.na(building_code_new))

# I think that we shouldn't use central_air or total_area - there are lots of NAs and 0 for them respectively
#also number of rooms

Philly_data_working <- Philly_data_working %>%
  select(-central_air, -total_area, -number_of_rooms)

Properties Dataset Before and After Cleaning

Show Code
# BEFORE CLEANING
before_summary <- data.frame(
  Dataset = "Raw OPA Dataset",
  Observations = 583534,
  Variables = 80
)

# AFTER CLEANING
after_summary <- data.frame(
  Dataset = "Cleaned Housing Dataset",
  Observations = nrow(Philly_data_working),
  Variables = ncol(Philly_data_working)
)

dimension_table <- rbind(before_summary, after_summary)
library(knitr)
kable(dimension_table, caption = "Dataset Dimensions Before and After Cleaning")
Dataset Dimensions Before and After Cleaning
Dataset Observations Variables
Raw OPA Dataset 583534 80
Cleaned Housing Dataset 26799 16

Census Data

To capture the broader socioeconomic context of neighborhoods, demographic, and economic variables were obtained from the American Community Survey (ACS) 5-year estimates. We retrieved Census tract-level data using the tidycensus package. We include several variables commonly associated with neighborhood housing markets, like total population, educational attainment, racial and ethnic composition, and median household income. In addition to the raw ACS counts, we categorize neighborhoods based on their median home value into four categories, from super high value to low value.

First, the share of residents with a bachelor’s degree was calculated as the proportion of the total population holding a bachelor’s degree. Educational attainment is often used as a proxy for neighborhood socioeconomic status and human capital. Second, racial and ethnic composition variables were calculated as the proportion of residents identifying as Black, Latino, or Native American. These variables help capture patterns of residential segregation and demographic composition across neighborhoods. Population density was calculated by dividing the total tract population by the land area of the tract in square miles. Population density is an important indicator of urban form and development intensity, which can influence housing demand and property values. After computing these derived measures, the census tract geometries were projected into the same coordinate reference system as the housing dataset. A spatial join was then performed to assign each property sale the characteristics of the census tract in which it is located.

Code gathering census data

Show Code
# NEED TO GET: POPULATION, PERCENT WITH BACHELORS DEGREE, PERCENT BLACK, LATINO, OR NATIVE AMERICAN

philly_acs <- get_acs(
  geography = "tract",
  state = "PA",
  county = "Philadelphia",
  variables = c(
    population = "B01003_001",
    bachelors = "B15003_022",
    black = "B02001_003",
    latino = "B03003_003",
    native = "B02001_004",
    median_income = "B19013_001"
  ),
  year = 2024,
  survey = "acs5",
  geometry = TRUE,
  output = "wide"
)

philly_acs <- st_transform(philly_acs, st_crs(Philly_data_working))

Spatial Data

Loading in and preparing our spatial data

Show Code
hoods <- st_read(here("labs/lab_3/data/philadelphia-neighborhoods.geojson"), quiet = TRUE)

hoods <- hoods %>%
  st_transform(st_crs(Philly_data_working)) 

philly_data_working_hoods <- Philly_data_working %>%
  st_join(hoods)

philly_data_working_hoods <- philly_data_working_hoods %>%
  select(-LISTNAME, -MAPNAME) %>%
  group_by(NAME) %>%
  filter(!is.na(NAME)) %>% #removing the sale that didn't have a corresponding neighborhood
  summarize(median_hood_price = median(sale_price)) 

philly_data_working_hoods <- Philly_data_working %>%
  st_join(philly_data_working_hoods)

philly_data_working_hoods <- philly_data_working_hoods %>%
  filter(!is.na(NAME))

philly_data_working_hoods <- philly_data_working_hoods %>%
 mutate(percent_rank_home_value = percent_rank(median_hood_price))

philly_data_working_hoods <- philly_data_working_hoods %>%
  mutate(
    neighborhood_category = case_when(
      percent_rank_home_value > 0.90 ~ "super high value",
      percent_rank_home_value > 0.75 ~ "high value",
      percent_rank_home_value > 0.50 ~ "upper-mid value",
      percent_rank_home_value > 0.25 ~ "lower-mid value",
      TRUE ~ "low value"
    )
  ) #Note the way that we categorized neighborhoods. Note that we could add another "ultra-high" category. 
#Loading in the needed spatial data
septa <- st_read(here("labs/lab_3/data/Transit_Stops_(Spring_2025).geojson"), quiet = TRUE)
Vacant_Block_Percentages <- st_read(here("labs/lab_3/data/Vacant_Block_Percent_Building.geojson"), quiet = TRUE)
Vacant_Indicators <- st_read(here("labs/lab_3/data/Vacant_Indicators_Bldg.geojson"), quiet = TRUE)
PPR_Props <- st_read(here("labs/lab_3/data/PPR_Properties.geojson"), quiet = TRUE)
Crimes <- st_read(here("labs/lab_3/data/crime_counts.geojson"), quiet = TRUE)
crime_data <- st_read(here("labs/lab_3/data/incidents_part1_part2/incidents_part1_part2.shp"), quiet = TRUE) 
crime_data <- crime_data %>%
  filter(!st_is_empty(geometry) & !is.na(geometry))

#Cleaning Crimes Data
# ensure geometries are valid
crime_data <- crime_data %>%
  filter(st_is_valid(geometry))

# remove missing coordinates
crime_data <- crime_data %>%
  filter(!is.na(point_x) & !is.na(point_y))

# define violent crimes; after looking at the column of crimes, i selected these as violent. i excluded robberies even those with firearms. see notes for full list
violent_codes <- c(
  "Aggravated Assault No Firearm",
  "Aggravated Assault Firearm",
  "Other Assaults",
  "Rape",
  "Homicide - Criminal"
)

# filter for violent crimes
violent_crimes <- crime_data %>%
  filter(text_gener %in% violent_codes)

# keep only necessary columns
violent_crimes <- violent_crimes %>%
  select(objectid, ucr_genera, text_gener, point_x, point_y, geometry)

# renaming columns
violent_crimes <- violent_crimes %>%
  rename(
    crime_code = ucr_genera,
    crime_type = text_gener
  )

# transform crs to match
violent_crimes <- st_transform(violent_crimes, 2272)

# double check there are no NAs
colSums(is.na(violent_crimes))
  objectid crime_code crime_type    point_x    point_y   geometry 
         0          0          0          0          0          0 
Show Code
# check and remove duplicates
violent_crimes <- violent_crimes %>%
  distinct(objectid, .keep_all = TRUE)

# make sure crs matches across datasets
violent_crimes <- st_transform(violent_crimes, st_crs(philly_acs))

# join crimes to census tracts by assigning each crime to the tract where it occured
crime_tracts <- st_join(violent_crimes, philly_acs, join = st_within)

# rewrote this next line because there was a row of NAs for geoid
crime_counts <- crime_tracts %>%
  st_drop_geometry() %>%
  filter(!is.na(GEOID)) %>%   # remove crimes not assigned to a tract
  group_by(GEOID) %>%
  summarise(
    violent_crimes = n()
  )

spta_rail <- septa %>%
  filter(LineAbbr %in% c("L1 OWL", "B1 OWL", "T5", "T4", "T3", "T2", "T1", "G1"))

regional_rail_stations <- st_read(here("labs/lab_3/data/Regional_Rail_Stations.geojson"), quiet = TRUE)

In addition to structural and demographic factors, housing prices are influenced by the spatial accessibility of urban amenities. We include several spatial features constructed from data from OpenDataPhilly to capture access to transportation, parks, and Center City.

Transit Accessibility

We measure transit accessibility by using geospatial datasets of SEPTA transit stops and Regional Rail stations. Rail transit stops were filtered to include only lines providing high-capacity service because we deemed these to be more valuable for convenience than other stops. For each property, the nearest transit stop and nearest Regional Rail station were idenfitied using the st_nearest_feature() function. We calculated the distances from each property to the nearest transit stop and Regional Rail station. After an initial evaluation of our model with these metrics, we decided to go back and create binary indicators for whether a property is located within one-half mile of a transit station. We did this to improve the RMSE of our model.

Park Accessibility

Geographic data on public parks and recreation properties were obtained from the Philadelphia Parks and Recreation department. Using spatial nearest-neighbor calculations, the distance from each property to the nearest park was computed and converted into miles. This variable captures the accessibility of green space for each residential property.

Downtown Accessibility

We include a metric measuring proximity to downtown because homes are generally more expensive because of land scarcity, convenience and access to more and higher quality amenities, and closeness to employment hubs. To measure this relationship in our models, we select City Hall as a proxy for the city’s central business district. We created a point geometry for ‘downtown’ using the geographic coordinates for city Hall. Then, we calculated the distance from each property to City Hall. Our preliminary analysis suggested that the relationship between house values and downtown distance may be nonlinear, so we also conducted a quadratic transformation of this distance to use in our model.

Displaying Spatial Datasets

Show Code
#Now developing a summary table for displaying datasets
spatial_summary <- data.frame(
  
  Dataset = c(
    "Raw Crime Data",
    "Cleaned Violent Crime Dataset",
    "Park Properties (PPR)",
    "Transit Stops",
    "Regional Rail Stations",
    "Vacant Block Percentages"
  ),
  
  Rows = c(
    nrow(crime_data),
    nrow(violent_crimes),
    nrow(PPR_Props),
    nrow(septa),
    nrow(regional_rail_stations),
    nrow(Vacant_Block_Percentages)
  ),
  
  Columns = c(
    ncol(crime_data),
    ncol(violent_crimes),
    ncol(PPR_Props),
    ncol(septa),
    ncol(regional_rail_stations),
    ncol(Vacant_Block_Percentages)
  )
)

kable(
  spatial_summary,
  caption = "Table X. Spatial Amenities Datasets Used in the Analysis"
)
Table X. Spatial Amenities Datasets Used in the Analysis
Dataset Rows Columns
Raw Crime Data 148837 14
Cleaned Violent Crime Dataset 34385 6
Park Properties (PPR) 504 26
Transit Stops 22478 10
Regional Rail Stations 156 7
Vacant Block Percentages 22623 8

Exploratory Analysis

Sale Price

Show Code
skewplot1 <- ggplot() +
  geom_sf(data = phillyboundary,
          fill = "darkgrey",
          color = "white",
          linewidth = 0.4) +
  
  geom_sf(data = philly_data_working_hoods,
          aes(color = sale_price),
          alpha = 1,
          size = 0.05) +
  
  scale_color_viridis_c(
    option = "plasma",
    trans = "log10",
    labels = dollar_format()
  ) +
  
  labs(
    title = "Residential Sale Prices in Philadelphia",
    subtitle = "Log-scaled sale price",
    color = "Sale Price"
  ) +
  
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

Livable area:

Livable area shows one of the strongest relationships with housing price. Larger homes consistently sell for higher prices, and the positive linear trend between log livable area and log sale price suggests that increases in square footage substantially increase housing value. This strong association supports the inclusion of livable area as a key predictor in the housing price model.

Show Code
ggplot(philly_data_working_hoods,
       aes(x = log(total_livable_area),
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  geom_smooth(method = "lm",
              color = "red",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Livable Area",
    x = "Log Livable Area",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Exterior condition:

Sale prices increase with improvements in exterior condition. Homes classified in better exterior condition categories tend to have higher median sale prices, suggesting that visible structural maintenance and curb appeal are valued by buyers. However, there is still substantial variation in prices within each condition category, indicating that other structural and neighborhood factors also influence housing prices.

Show Code
ggplot(philly_data_working_hoods,
       aes(x = as.factor(exterior_condition),
           y = log(sale_price))) +
  
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  
  labs(
    title = "Housing Prices by Exterior Condition",
    x = "Exterior Condition",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Building type:

Housing prices vary across building types. Certain building types, like larger single-family homes or multi-story properties, tend to command higher prices compared to smaller or more basic structures. This pattern likely reflects differences in housing size, construction quality, and overall desirability across property types.

Show Code
ggplot(philly_data_working_hoods,
       aes(x = as.factor(building_code_new),
           y = log(sale_price))) +
  
  geom_boxplot(fill = "darkorange", alpha = 0.7) +
  
  labs(
    title = "Housing Prices by Building Type",
    x = "Building Code",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Garage spaces:

There is a positive relationship between the number of garage spaces and housing prices. Homes with more garage spaces generally sell for higher prices, reflecting the added value of off-street parking in dense urban areas like Philadelphia. The trend line indicates that each additional garage space is associated with higher expected sale prices.

Show Code
ggplot(philly_data_working_hoods,
       aes(x = garage_spaces,
           y = log(sale_price))) +
  
  geom_jitter(alpha = 0.3, width = 0.2) +
  
  geom_smooth(method = "lm",
              color = "red",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Garage Spaces",
    x = "Number of Garage Spaces",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Number of bathrooms:

Housing prices rise with the number of bathrooms. Properties with more bathrooms tend to command higher sale prices, likely because additional bathrooms increase household convenience and signal larger or more modern homes. The upward trend suggests that bathroom count is an important structural feature influencing property value.

Show Code
ggplot(philly_data_working_hoods,
       aes(x = number_of_bathrooms,
           y = log(sale_price))) +
  
  geom_jitter(alpha = 0.3, width = 0.2) +
  
  geom_smooth(method = "lm",
              color = "red",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Number of Bathrooms",
    x = "Number of Bathrooms",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Number of bedrooms:

Homes with more bedrooms generally sell for higher prices, although the relationship is less linear than for some other structural features. While additional bedrooms increase potential household capacity, the variation in prices suggests that bedroom count alone does not fully capture the value of a property, especially when other factors such as total living area and neighborhood characteristics are considered.

Show Code
ggplot(philly_data_working_hoods,
       aes(x = number_of_bedrooms,
           y = log(sale_price))) +
  
  geom_jitter(alpha = 0.3, width = 0.2) +
  
  geom_smooth(method = "lm",
              color = "red",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Number of Bedrooms",
    x = "Number of Bedrooms",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Age of structure:

The relationship between housing age and sale price appears nonlinear. Newer homes often command higher prices due to modern construction and lower maintenance needs, while moderately older homes may still retain value depending on renovations or neighborhood desirability. Very old homes may show greater variation in price, reflecting differences in renovation status and historic housing stock. We removed outliers by filtering for homes less than 300 years old.

Show Code
philly_data_working_hoods <- philly_data_working_hoods %>%
  mutate(year_built = as.numeric(year_built),
    structure_age = (2026 - year_built)) %>%
  filter(structure_age < 300)

ggplot(philly_data_working_hoods,
       aes(x = structure_age,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  geom_smooth(method = "lm",
              formula = y ~ x + I(x^2),
              color = "darkred",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Structure Age",
    x = "Structure Age (years)",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Spatial Distribution of Housing Values by Neighborhood

The spatial distribution of housing values reveals clear geographic patterns across Philadelphia. Higher-value neighborhoods are generally clustered amongst each other like from Fishtown down to Queen Village, where homes are categorized between high value and super high value. Lower-value neighborhoods are also concentrated, but throughout parts of North and West Philadelphia, where homes are low and lower-mid value. Unsurprisingly, these spatial patterns suggest that wealth and poverty are similarly concentrated, and neighborhoods are rarely mixed-income.

Show Code
# reorder legend items
philly_data_working_hoods <- philly_data_working_hoods %>%
  mutate(
    neighborhood_category = factor(
      neighborhood_category,
      levels = c(
        "low value",
        "lower-mid value",
        "upper-mid value",
        "high value",
        "super high value"
      )
    )
  )

# create centroids for label placement
hood_centroids <- st_centroid(hoods)

# plot
ggplot(philly_data_working_hoods) +
  
   geom_sf(data = hoods, # neighborhood basemmap
          fill = "grey92",
          color = "white",
          linewidth = 0.2) +
  geom_sf_text(data = hood_centroids,
               aes(label = NAME),
               size = 2,
               color = "gray35",
               check_overlap = TRUE) +
  
  geom_sf(aes(color = neighborhood_category),
          size = 0.3,
          alpha = 0.6) +
  
  scale_color_viridis_d(option = "plasma") +
  
  labs(
    title = "Housing Prices by Neighborhood Value Category",
    subtitle = "Neighborhoods Classified by Median Home Price",
    color = "Neighborhood Category"
  ) +
    guides(color = guide_legend(
    override.aes = list(size = 4, alpha = 1)
  )) +
  
  theme_void()

Distance to Downtown

The relationship between housing prices and distance from downtown Philadelphia appears nonlinear. Housing prices initially decline as distance from the city center increases, but after several miles prices begin to rise again. This U-shaped pattern likely reflects Philadelphia’s spatial housing structure, where both central neighborhoods and some suburban areas command higher prices, while certain intermediate areas have lower housing values. This pattern supports the inclusion of a quadratic distance term in the regression model to capture this nonlinear relationship.

Show Code
cityhall <- st_point(c(-75.1635, 39.9526)) %>% #coordinates for city hall
  st_sfc(crs = 4326) %>% 
  st_transform(st_crs(philly_data_working_hoods)) #make sure crs match
Show Code
philly_data_working_hoods$dist_to_downtown <- st_distance(
  philly_data_working_hoods,
  cityhall
)


philly_data_working_hoods$dist_to_downtown_miles <- 
  as.numeric(philly_data_working_hoods$dist_to_downtown) / 5280 #convert values into numeric and converting from feet to miles
Show Code
ggplot(philly_data_working_hoods,
       aes(x = dist_to_downtown_miles,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  geom_smooth(method = "lm",
              formula = y ~ x + I(x^2),
              color = "darkred",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Distance to Center City",
    x = "Distance to Downtown (miles)",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Distance to parks

The relationship between housing prices and distance to parks appears weak. The regression line is nearly flat, indicating that distance to the nearest park has only a small association with housing prices in the dataset. Although access to parks may contribute to neighborhood quality of life, the scatter of points suggests that other structural and neighborhood factors likely play a more significant role in determining housing prices.

Show Code
#Now need to compute distance to parks for each house

PPR_Props <- PPR_Props %>% 
  st_transform(st_crs(philly_data_working_hoods))

nearest_park <- st_nearest_feature(philly_data_working_hoods, PPR_Props)

philly_data_working_hoods$nearest_park <- PPR_Props$public_name[nearest_park]

philly_data_working_hoods$dist_to_park <- st_distance(
  philly_data_working_hoods,
  PPR_Props[nearest_park, ],
  by_element = TRUE
)

philly_data_working_hoods$dist_to_park_miles <- 
  as.numeric(philly_data_working_hoods$dist_to_park) / 5280
Show Code
ggplot(philly_data_working_hoods,
       aes(x = dist_to_park_miles,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  geom_smooth(method = "lm",
              color = "darkgreen",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Distance to Parks",
    x = "Distance to Nearest Park (miles)",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Transit Proximity

Homes located within half a mile of transit stations appear to have slightly higher median sale prices compared to homes located farther away. This pattern suggests that accessibility to public transportation may increase housing demand by improving access to employment centers and other amenities. However, as with Regional Rail access, the substantial overlap between the two distributions indicates that transit proximity is only one of many factors influencing property values.

Show Code
septa_rail <- st_transform(spta_rail, st_crs(philly_data_working_hoods))
regional_rail_stations <- st_transform(regional_rail_stations, st_crs(philly_data_working_hoods))

nearest_transit <- st_nearest_feature(philly_data_working_hoods, septa_rail)
nearest_RR <- st_nearest_feature(philly_data_working_hoods, regional_rail_stations)

philly_data_working_hoods$nearest_transit <- septa_rail$StopName[nearest_transit]
philly_data_working_hoods$nearestRR <- regional_rail_stations$Station_Na[nearest_RR]

philly_data_working_hoods$dist_to_transit <- st_distance(
  philly_data_working_hoods,
  septa_rail[nearest_transit, ],
  by_element = TRUE
)

philly_data_working_hoods$dist_to_RR <- st_distance(
  philly_data_working_hoods,
  regional_rail_stations[nearest_RR, ],
  by_element = TRUE
)

philly_data_working_hoods$dist_to_transit_miles <- 
  as.numeric(philly_data_working_hoods$dist_to_transit) / 5280

philly_data_working_hoods$dist_to_RR_miles <- 
  as.numeric(philly_data_working_hoods$dist_to_RR) / 5280

philly_data_working_hoods <- philly_data_working_hoods %>%
  mutate(close_proximity_to_transit = ifelse(dist_to_transit_miles > 0.5, 1, 0),
         close_proximity_to_RR = ifelse(dist_to_RR_miles > 0.5, 1, 0)
  )

ggplot(philly_data_working_hoods,
       aes(x = as.factor(close_proximity_to_transit),
           y = log(sale_price))) +
  
  geom_boxplot(fill = "steelblue",
               alpha = 0.7) +
  
  labs(
    title = "Housing Prices and Transit Accessibility",
    x = "Within 0.5 Miles of Transit",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Regional Rail Proximity:

The boxplot comparing homes within and beyond 0.5 miles of Regional Rail stations shows relatively similar distributions of housing prices between the two groups. While homes closer to Regional Rail appear to have slightly higher median prices, the difference is modest and there is substantial overlap between the distributions. This suggests that proximity to Regional Rail may provide some value to homeowners but is not a dominant factor in determining housing prices on its own.

Show Code
ggplot(philly_data_working_hoods,
       aes(x = as.factor(close_proximity_to_RR),
           y = log(sale_price))) +
  
  geom_boxplot(fill = "darkorange",
               alpha = 0.7) +
  
  labs(
    title = "Housing Prices and Regional Rail Accessibility",
    x = "Within 0.5 Miles of Regional Rail",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Neighborhood Vacancy

There is a strong negative relationship between neighborhood vacancy rates and housing prices. Areas with higher percentages of vacant buildings tend to have significantly lower housing prices. The steep downward trend line indicates that vacancy may be a particularly important indicator of neighborhood distress and disinvestment. High vacancy levels often signal weaker housing demand, reduced neighborhood amenities, and lower perceived neighborhood stability, all of which can depress property values.

Show Code
#Joining vacant blocks percentages to housing sales data. ***Need to note where vacancy data comes from***

Vacant_Block_Percentages <- Vacant_Block_Percentages %>%
  select(buildvacpercentage, geometry)

Vacant_Block_Percentages <- st_transform(Vacant_Block_Percentages, st_crs(philly_data_working_hoods)) 

philly_data_working_hoods <- philly_data_working_hoods %>%
  st_join(Vacant_Block_Percentages)

ggplot(philly_data_working_hoods,
       aes(x = buildvacpercentage,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  geom_smooth(method = "lm",
              color = "purple",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Neighborhood Vacancy Rates",
    x = "Percent Vacant Buildings",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Violent Crime Rate:

The scatter plot shows a clear negative relationship between violent crime rates and housing prices. As the number of violent crimes per 1,000 residents increases, the average log sale price of homes tends to decrease. The downward sloping regression line suggests that homes located in neighborhoods with higher levels of violent crime generally sell for lower prices. However, the substantial spread of points indicates that crime alone does not determine housing values, and other structural and neighborhood factors also influence sale prices. We also believe that this pattern appears because crime tends to be more centrally located. Further, crimes may be more likely to be reported in wealthier neighborhoods because of systemic issues.

Show Code
# remove NAs
crime_data <- crime_data %>%
  filter(!st_is_empty(geometry) & !is.na(geometry))

# ensure geometries are valid
crime_data <- crime_data %>%
  filter(st_is_valid(geometry))

# remove missing coordinates
crime_data <- crime_data %>%
  filter(!is.na(point_x) & !is.na(point_y))

# define violent crimes; after looking at the column of crimes, i selected these as violent. i excluded robberies even those with firearms. see notes for full list
violent_codes <- c(
  "Aggravated Assault No Firearm",
  "Aggravated Assault Firearm",
  "Other Assaults",
  "Rape",
  "Homicide - Criminal"
)

# filter for violent crimes
violent_crimes <- crime_data %>%
  filter(text_gener %in% violent_codes)

# keep only necessary columns
violent_crimes <- violent_crimes %>%
  select(objectid, ucr_genera, text_gener, point_x, point_y, geometry)

# renaming columns
violent_crimes <- violent_crimes %>%
  rename(
    crime_code = ucr_genera,
    crime_type = text_gener
  )

# transform crs to match
violent_crimes <- st_transform(violent_crimes, 2272)

# double check there are no NAs
colSums(is.na(violent_crimes))
  objectid crime_code crime_type    point_x    point_y   geometry 
         0          0          0          0          0          0 
Show Code
# check and remove duplicates
violent_crimes <- violent_crimes %>%
  distinct(objectid, .keep_all = TRUE)

# make sure crs matches across datasets
violent_crimes <- st_transform(violent_crimes, st_crs(philly_acs))

# join crimes to census tracts by assigning each crime to the tract where it occured
crime_tracts <- st_join(violent_crimes, philly_acs, join = st_within)

# rewrote this next line because there was a row of NAs for geoid
crime_counts <- crime_tracts %>%
  st_drop_geometry() %>%
  filter(!is.na(GEOID)) %>%   # remove crimes not assigned to a tract
  group_by(GEOID) %>%
  summarise(
    violent_crimes = n()
  )
Show Code
# join crime counts back to tracts in philly acs
philly_acs <- philly_acs %>%
  left_join(crime_counts, by = "GEOID")

# replace tracts with no crimes (NA to 0)
philly_acs <- philly_acs %>%
  mutate(violent_crimes = replace_na(violent_crimes, 0))

# get crime rate per 1,000 residents
philly_acs <- philly_acs %>%
  mutate(
    violent_crime_rate = (violent_crimes / populationE) * 1000
  )

philly_data_working_hoods <- philly_data_working_hoods %>%
  st_join(philly_acs)

philly_data_working_hoods <- philly_data_working_hoods %>%
  as.tibble %>%
  left_join(Crimes, by = "GEOID")

philly_data_working_hoods <- philly_data_working_hoods %>%
  select(-geometry.y)

philly_data_working_hoods <- philly_data_working_hoods %>%
  st_as_sf()
Show Code
ggplot(philly_data_working_hoods,
       aes(x = violent_crime_rate,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  geom_smooth(method = "lm",
              color = "red",
              linewidth = 1.2) +
  
  labs(
    title = "Housing Prices and Violent Crime Rates",
    x = "Violent Crimes per 1,000 Residents",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

Feature Engineering

The following section will detail how we created spatial variables, census features and interaction terms as well as justifications for data transformations.

Show Code
#Looking at percents for tracts
philly_acs <- philly_acs %>%
  mutate(per_bachelors = bachelorsE / populationE,
         per_black = blackE / populationE,
         per_latino = latinoE / populationE, 
         per_native = nativeE / populationE,
         population_density = populationE / (st_area(geometry) / 27878400)
  )

philly_acs$population_density <- as.numeric(philly_acs$population_density)
#joining census tract data to the cut-down housing data

philly_data_working_hoods <- philly_data_working_hoods %>%
  st_join(philly_acs)

Cenus Variables

We modified our census variables by calculating the percentage of bachelors degree holders and black or Latino residents relative to the census tract’s population. We also created a density variable based on the area of each census tract geometry.

###Population Density The below plot visualizes population density in Philadelphia

Show Code
ggplot(philly_acs) +
  geom_sf(aes(fill = population_density), color = NA) +
  
  scale_fill_viridis_c(
    option = "magma",
    trans = "log10",
    labels = comma
  ) +
  
  labs(
    title = "Population Density by Census Tract",
    subtitle = "Philadelphia",
    fill = "People per\nsq. mile"
  ) +
  
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    axis.text = element_blank(),
    axis.title = element_blank()
  )

Neighborhoods as a Fixed Effect

The below code was completed earlier in our anaylsis, but is below in commented form to reference previous work

Show Code
#hoods <- st_read(here("labs/lab_3/data/philadelphia-neighborhoods.geojson"))

#hoods <- hoods %>%
  #st_transform(st_crs(Philly_data_working)) 

#philly_data_working_hoods <- Philly_data_working %>%
  #st_join(hoods)

#philly_data_working_hoods <- philly_data_working_hoods %>%
 # select(-LISTNAME, -MAPNAME) %>%
  #group_by(NAME) %>%
  #filter(!is.na(NAME)) %>% #removing the sale that didn't have a corresponding neighborhood
  #summarize(median_hood_price = median(sale_price)) 

#philly_data_working_hoods <- Philly_data_working %>%
  #st_join(philly_data_working_hoods)

#philly_data_working_hoods <- philly_data_working_hoods %>%
  #filter(!is.na(NAME))

#philly_data_working_hoods <- philly_data_working_hoods %>%
 #mutate(percent_rank_home_value = percent_rank(median_hood_price))

#philly_data_working_hoods <- philly_data_working_hoods %>%
  #mutate(
    #neighborhood_category = case_when(
      #percent_rank_home_value > 0.90 ~ "super high value",
      #percent_rank_home_value > 0.75 ~ "high value",
     # percent_rank_home_value > 0.50 ~ "upper-mid value",
     # percent_rank_home_value > 0.25 ~ "lower-mid value",
      #TRUE ~ "low value"
    #)
 # ) #Note the way that we categorized neighborhoods. Note that we could add another "ultra-high" category. 

In order to understand the effects that neighborhood has on home values, we gathered neighborhood data, and then grouped neighborhoods into various value groupings.

We used geometry provided by OpenDataPhilly to sort the city into neighborhoods. These shapes were compiled with feedback from residents and the Philadelphia Inquirer. After simple cleaning, we then examined the average sale price by neighborhood and divided each neighborhood into corresponding percentile ranks. We ultimately created five categories, defined as quartiles until the 90th percentile, at which we created a “super-high value” neighborhood category. This is because median home values do not follow a normal distribution, but rather have a long right tail in very high value neighborhoods. This method seeks to capture the effect of these neighborhoods independently.

##Proximity to Transit

In our initial hypothesis, we believed that close proximity to transit will increase home values. This is because proximity to rail transit provides easy access to employment opportunities and other amenities in Center City. We also hypothesized that regional rail would have similar effects, but chose to include it as another variable. We did not include regional rail as a form of rapid transit, because it serves a different demographic than transit does (suburban commuters), and has different service frequencies.

We defined transit as subway, elevated, and trolley service. Regional rail is defined as stations in SEPTA’s regional rail network. Both datasets came form OpenData Philly and required basic cleaning to select for desired lines.

Counter-intuitively, we found that proximity to transit decreased home values, not increased. This is likely because of real or perceived dis-amenities associated with transit in Philadelphia including crime and uncleanliness. Furthermore, homes that are a far distance from transit may be more likely to be in high value suburban neighborhoods in the Northwest of the City, making raw distance to transit an unreliable predictor of home values city-wide.

Show Code
philly_data_working_hoods %>%
  mutate(dist_to_transit = as.numeric(dist_to_transit)) %>%
  ggplot(aes(x = dist_to_transit, y = log(sale_price))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(
    x = "Distance to Transit (feet)",
    y = "Log Sale Price"
  )

Looking at regional rail however, we see the opposite effect, as proximity to regional rail has a positive effect on Sale Price. While this may be because regional rail does not have the same stigmas associated with it, it may also be because regional rail disproportionately goes through higher-income suburban neighborhoods. Therefore, we decided that like transit, regional rail proximity wasn’t a reliable predictor.

Show Code
philly_data_working_hoods %>%
  mutate(dist_to_RR = as.numeric(dist_to_RR)) %>%
  ggplot(aes(x = dist_to_RR, y = log(sale_price))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(
    x = "Distance to Regional Rail (feet)",
    y = "Log Sale Price"
  )

To correct for these two factors, we created a dummy variable which notes close proximity (defined as half a mile - approximately a comfortable distance to walk in the morning / afternoon) to transit. This way, we can see the effect that close proximity to regional rail or transit has on home values compared to homes where transit commuting is less realistic.

Show Code
#Looking at transit proximity in a model

transit_adjacency <- lm(sale_price ~ as.factor(close_proximity_to_transit) + as.factor(close_proximity_to_RR), data = philly_data_working_hoods)

print(transit_adjacency)

Call:
lm(formula = sale_price ~ as.factor(close_proximity_to_transit) + 
    as.factor(close_proximity_to_RR), data = philly_data_working_hoods)

Coefficients:
                           (Intercept)  as.factor(close_proximity_to_transit)1  
                                383350                                  -38714  
     as.factor(close_proximity_to_RR)1  
                                -45062  
Show Code
#Might be interesting to see how transit impacts values at different scales - e.g. looking how it impacts Fishtown vs. N. Philly

Notably, transit adjacency and regional rail proximity decrease home values, with transit decreasing home values by approximately 39,000 dollars and proximity to regional rail decreasing home values by 46,000 dollars on average.

Vacancy

We included the percentage of each block that’s vacant as a predictor of home values. This is under the assumption that vacancy both indicates a lack of demand for a given area, and also has several dis-amenities such as unsightliness, crime, trash, and safety concerns. We decided to undertake a block-level analysis instead of looking at proximity to nearest vacant property, as we believe that vacant homes have a block-level impact rather than a singular impact on the nearest home. For example, we believe that a home that is further away from a vacant house, yet is on a block with over 50% vacancy will be more negatively impacted than one that is simply close to a vacant lot, but in a neighborhood which is otherwise free of vacancies.

We spatially joined vacant block percentages to the homes data set to find the block percentage for each home.

Show Code
#Joining vacant blocks percentages to housing sales data. ***Need to note where vacancy data comes from***

Vacant_Block_Percentages <- Vacant_Block_Percentages %>%
  select(buildvacpercentage, geometry)

Vacant_Block_Percentages <- st_transform(Vacant_Block_Percentages, st_crs(philly_data_working_hoods)) 

philly_data_working_hoods <- philly_data_working_hoods %>%
  st_join(Vacant_Block_Percentages)

Violent Crimes

We included violent crimes in our predictive analysis under the assumption of a negative relationship. We specifically used violent crimes, as we believe that total crime count would be more indicative of population density rather than actual safety concerns. For example, we theorized Rittenhouse to be a high-crime, yet high-value neighborhood due to its high population density, large number of commercial spaces and high volume of through traffic and visitors. We assumed that violent crime matters significantly more in home values than overall crime.

We defined violent crimes as aggravated assault, other assaults, rape, and criminal homicide. There may be bias in our crime data, as black and brown neighborhoods are more likely to be over-policed, and thus more likely to have higher reported crime rates than other neighborhoods. After filtering our data and doing basic cleaning, we aggregated crimes at the census tract level and joined this to our homes data.

Distance to Parks

We assume that proximity to parks will have a positive relationship to sale price, as parks are a valuable neighborhood amenity. To find this, we found the shortest distance from each home in our data set to the nearest park. This provided us with the distance to parks for each home, which we used in our regression analysis.

Distance to Center City

We initially assumed that distance to Center City has a negative relationship with home values. This is because we assumed that the high density of jobs and amenities clustered in Center City would cause home values to fall as one gained in distance from the center. As our representation of downtown, we used City Hall due to its central location in Center City.

Show Code
philly_data_working_hoods %>%
  ggplot(aes(x = dist_to_downtown_miles, y = log(sale_price))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(
    x = "Distance to Center City (miles)",
    y = "Log Sale Price"
  )

However, we were wrong in our assumption of a negative linear relationship, as the true shape appears to be more quadratic in nature. Here home values are highest close to center city, and lose value until 6.5 miles from City Hall, where home values begin to increase. This is likely due to suburbanization in Philadelphia producing home values that are higher than those in the inner-city despite being further from Center City.

Show Code
# creating new variable for square of distance to downtown to test curved relationship instead of linear
philly_data_working_hoods <- philly_data_working_hoods %>%
  mutate(dist_downtown_sq = dist_to_downtown_miles^2) 

# running quadratic model
model_quad_dist_downtown <- lm(
  log(sale_price) ~
    dist_to_downtown_miles +
    dist_downtown_sq,
  data = philly_data_working_hoods %>% st_drop_geometry()
)

summary(model_quad_dist_downtown)

Call:
lm(formula = log(sale_price) ~ dist_to_downtown_miles + dist_downtown_sq, 
    data = philly_data_working_hoods %>% st_drop_geometry())

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6016 -0.4222  0.0891  0.4559  3.1279 

Coefficients:
                         Estimate Std. Error t value            Pr(>|t|)    
(Intercept)            12.9458035  0.0133836  967.29 <0.0000000000000002 ***
dist_to_downtown_miles -0.2713299  0.0049004  -55.37 <0.0000000000000002 ***
dist_downtown_sq        0.0209777  0.0003658   57.34 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7943 on 26804 degrees of freedom
Multiple R-squared:  0.1094,    Adjusted R-squared:  0.1094 
F-statistic:  1647 on 2 and 26804 DF,  p-value: < 0.00000000000000022
Show Code
b1 <- coef(model_quad_dist_downtown)["dist_to_downtown_miles"]
b2 <- coef(model_quad_dist_downtown)["dist_downtown_sq"]

turning_point <- -b1 / (2 * b2)
turning_point
dist_to_downtown_miles 
              6.467103 
Show Code
ggplot(philly_data_working_hoods,
       aes(x = dist_to_downtown_miles, y = log(sale_price))) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "lm",
              formula = y ~ x + I(x^2),
              color = "red",
              linewidth = 1.2) +
  theme_minimal() +
  labs(
    title = "Nonlinear Relationship Between Distance to Center City and Housing Prices",
    x = "Distance to Downtown (miles)",
    y = "Log Sale Price"
  )

Nearest Neighbor Sale Price

Finally, we assumed that the nearest neighboring home sale would strongly predict home sales, as this is a good localized fixed effect. To produce this data, we captured the coordinates for each property and then ran a nearest neighbor analysis on the five closest sale prices. Then the average of these neighbors was used as the nearst sale price for each property in the dataset.

Show Code
# we need coordinates for each property
coords <- st_coordinates(philly_data_working_hoods)

# find the nearest 5 neighbors for each property, excluding itself
nn <- get.knn(coords, k = 5)

neighbor_price <- apply(nn$nn.index, 1, function(i) {
  mean(philly_data_working_hoods$sale_price[i], na.rm = TRUE)
})

philly_data_working_hoods$neighbor_sale_price <- neighbor_price
philly_data_working_hoods$log_neighbor_price <- log(neighbor_price)

summary(philly_data_working_hoods$neighbor_sale_price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  23600  171500  260580  324163  392000 3431000 

Addressing Skewness In the Data

Many of our data sets were skewed with a long right tail. Since normal distribution is a core assumption of OLS, we log-transformed theses data sets to yield a more normal distribution.

Show Code
skewplot3 <- ggplot(philly_data_working_hoods,
       aes(x = structure_age,
           y = log(sale_price))) + 
  geom_point(alpha = 0.4) +
  theme_minimal() +
  labs(
    x = "Structure Age (years)",
    y = "Log Sale Price"
  )

skewplot1 <- ggplot(philly_data_working_hoods,
       aes(x = total_livable_area,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.4) +
  labs(
    title = "Housing Prices and Livable Area",
    x = "Livable Area",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

skewplot2 <- ggplot(philly_data_working_hoods,
       aes(x = dist_to_downtown_miles,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  labs(
    title = "Housing Prices and Distance to Center City",
    x = "Distance to Downtown (miles)",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

skewplot4 <- ggplot(philly_data_working_hoods,
       aes(x = dist_to_park_miles,
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  labs(
    title = "Housing Prices and Distance to Parks",
    x = "Distance to Nearest Park (miles)",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

library(gridExtra)
grid.arrange(skewplot1, skewplot2, skewplot3, skewplot4, ncol = 2)

After transforming the data through a log transformation, the data better fit a normal distribution.

Show Code
logplot3 <- ggplot(philly_data_working_hoods,
       aes(x = log(structure_age),
           y = log(sale_price))) + 
  geom_point(alpha = 0.4) +
  theme_minimal() +
  labs(
    x = "Log Structure Age (years)",
    y = "Log Sale Price"
  )

logplot1 <- ggplot(philly_data_working_hoods,
       aes(x = log(total_livable_area),
           y = log(sale_price))) +
  
  geom_point(alpha = 0.4) +
  labs(
    title = "Housing Prices and Livable Area",
    x = "Log Livable Area",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

logplot2 <- ggplot(philly_data_working_hoods,
       aes(x = log(dist_to_downtown_miles),
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  labs(
    title = "Housing Prices and Distance to Center City",
    x = "Log Distance to Downtown (miles)",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

logplot4 <- ggplot(philly_data_working_hoods,
       aes(x = log(dist_to_park_miles),
           y = log(sale_price))) +
  
  geom_point(alpha = 0.25) +
  
  labs(
    title = "Housing Prices and Distance to Parks",
    x = "Log Distance to Nearest Park (miles)",
    y = "Log Sale Price"
  ) +
  
  theme_minimal()

library(gridExtra)
grid.arrange(logplot1, logplot2, logplot3, logplot4, ncol = 2)

adding in age and removing 0s for square footage

Show Code
philly_data_working_hoods <- philly_data_working_hoods %>%
  mutate(structure_age = (2025 - as.numeric(year_built)))

philly_data_working_hoods <- philly_data_working_hoods %>%
  filter(total_livable_area > 0)

The below transformations were performed after getting our cooks distance results. These were the observations with the highest cooks distance.

Show Code
philly_data_working_hoods <- philly_data_working_hoods %>%
  filter(sale_price < 20000000)

philly_data_working_hoods <- philly_data_working_hoods %>%
  filter(location != "6025 JACKSON ST")

philly_data_working_hoods <- philly_data_working_hoods %>%
  filter(number_of_bathrooms > 0)

Building the Models

Model One: Structural Factors

The first model we developed only includes structural home features. This includes: exterior condition, the type of building, number of garage spaces, number of bathrooms and bedrooms, the age of the structure, and the total living area.

Linear Model

Show Code
structural_model <-  lm(log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms + number_of_bedrooms + I(structure_age^2) + log(total_livable_area), data = philly_data_working_hoods)
Show Code
tidy(structural_model) %>%
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 3),
    p.value = round(p.value, 4)
  ) %>%
  kable(
    col.names = c("Term", "Estimate", "Std. Error", "t Statistic", "P-Value"),
    caption = "Model Coefficients"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Model Coefficients
Term Estimate Std. Error t Statistic P-Value
(Intercept) 6.3022 0.1534 41.086 0.0000
as.factor(exterior_condition)2 -0.0630 0.0556 -1.133 0.2574
as.factor(exterior_condition)3 0.0377 0.0501 0.753 0.4512
as.factor(exterior_condition)4 -0.2207 0.0509 -4.334 0.0000
as.factor(exterior_condition)5 -0.7559 0.0580 -13.025 0.0000
as.factor(exterior_condition)6 -1.1188 0.0833 -13.424 0.0000
as.factor(exterior_condition)7 -1.3014 0.0712 -18.281 0.0000
as.factor(exterior_condition)8 -0.5174 0.6509 -0.795 0.4266
as.factor(building_code_new)04 -0.0154 0.1972 -0.078 0.9378
as.factor(building_code_new)05 -0.0340 0.0854 -0.398 0.6907
as.factor(building_code_new)06 -0.1151 0.1402 -0.821 0.4117
as.factor(building_code_new)07 -0.2646 0.0729 -3.632 0.0003
as.factor(building_code_new)08 0.4818 0.6518 0.739 0.4598
as.factor(building_code_new)09 -0.0958 0.0799 -1.199 0.2306
as.factor(building_code_new)10 -0.2155 0.0823 -2.618 0.0089
as.factor(building_code_new)13 -0.1013 0.1610 -0.629 0.5292
as.factor(building_code_new)14 -0.2512 0.0845 -2.971 0.0030
as.factor(building_code_new)18 -0.1467 0.2724 -0.539 0.5902
as.factor(building_code_new)21 0.0020 0.0758 0.026 0.9790
as.factor(building_code_new)22 -0.4242 0.0626 -6.774 0.0000
as.factor(building_code_new)23 -0.3234 0.0623 -5.188 0.0000
as.factor(building_code_new)24 -0.6905 0.0623 -11.077 0.0000
as.factor(building_code_new)25 -0.1669 0.0666 -2.507 0.0122
as.factor(building_code_new)26 -0.1118 0.0717 -1.560 0.1187
as.factor(building_code_new)27 -0.1789 0.0649 -2.755 0.0059
as.factor(building_code_new)28 0.0264 0.0714 0.370 0.7110
as.factor(building_code_new)29 -0.1849 0.1611 -1.148 0.2512
as.factor(building_code_new)30 0.0437 0.1334 0.327 0.7434
as.factor(building_code_new)31 -0.3637 0.1226 -2.966 0.0030
as.factor(building_code_new)32 -0.4677 0.0646 -7.243 0.0000
as.factor(building_code_new)33 -0.1770 0.1785 -0.992 0.3214
as.factor(building_code_new)34 -0.3766 0.1461 -2.577 0.0100
as.factor(building_code_new)36 -0.3889 0.0698 -5.575 0.0000
as.factor(building_code_new)40 0.6176 0.2251 2.744 0.0061
as.factor(building_code_new)41 -0.0129 0.0645 -0.200 0.8412
as.factor(building_code_new)42 -0.3480 0.0845 -4.119 0.0000
as.factor(building_code_new)43 -0.1449 0.0728 -1.991 0.0465
as.factor(building_code_new)44 0.2315 0.1201 1.928 0.0538
as.factor(building_code_new)45 -0.1528 0.0844 -1.810 0.0703
as.factor(building_code_new)46 -0.3185 0.1787 -1.782 0.0748
as.factor(building_code_new)47 -0.0946 0.0838 -1.128 0.2592
garage_spaces 0.0664 0.0092 7.227 0.0000
number_of_bathrooms 0.1792 0.0076 23.424 0.0000
number_of_bedrooms -0.1170 0.0063 -18.706 0.0000
I(structure_age^2) 0.0000 0.0000 7.880 0.0000
log(total_livable_area) 0.9300 0.0198 46.950 0.0000

Model Two: Structural Features and Census Data

This model adds census data to structural data. This includes the structural data mentioned above, alongside percent with a bachelors degree, percent latino, the population density, and median income at a census tract level.

Show Code
census_model <- lm(log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms + number_of_bedrooms + I(structure_age^2) + log(total_livable_area) + per_bachelors + per_black + per_latino + population_density + median_incomeE.x, data = philly_data_working_hoods)
summary(census_model)
Show Code
tidy(census_model) %>%
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 3),
    p.value = round(p.value, 4)
  ) %>%
  kable(
    col.names = c("Term", "Estimate", "Std. Error", "t Statistic", "P-Value"),
    caption = "Model Coefficients"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Model Coefficients
Term Estimate Std. Error t Statistic P-Value
(Intercept) 8.2975 0.1343 61.771 0.0000
as.factor(exterior_condition)2 -0.0846 0.0484 -1.750 0.0801
as.factor(exterior_condition)3 -0.0661 0.0434 -1.522 0.1281
as.factor(exterior_condition)4 -0.2382 0.0441 -5.396 0.0000
as.factor(exterior_condition)5 -0.5678 0.0505 -11.238 0.0000
as.factor(exterior_condition)6 -0.8456 0.0735 -11.507 0.0000
as.factor(exterior_condition)7 -1.0307 0.0637 -16.184 0.0000
as.factor(exterior_condition)8 -0.5937 0.5544 -1.071 0.2842
as.factor(building_code_new)04 -0.1444 0.1679 -0.860 0.3899
as.factor(building_code_new)05 -0.1040 0.0727 -1.430 0.1528
as.factor(building_code_new)06 -0.1290 0.1195 -1.079 0.2804
as.factor(building_code_new)07 -0.1628 0.0622 -2.617 0.0089
as.factor(building_code_new)08 -0.1826 0.5553 -0.329 0.7423
as.factor(building_code_new)09 -0.0792 0.0681 -1.164 0.2445
as.factor(building_code_new)10 -0.1438 0.0704 -2.042 0.0412
as.factor(building_code_new)13 -0.1129 0.1372 -0.823 0.4103
as.factor(building_code_new)14 -0.1480 0.0724 -2.043 0.0410
as.factor(building_code_new)18 0.0111 0.2321 0.048 0.9618
as.factor(building_code_new)21 -0.0792 0.0649 -1.220 0.2224
as.factor(building_code_new)22 -0.4011 0.0537 -7.469 0.0000
as.factor(building_code_new)23 -0.1595 0.0533 -2.992 0.0028
as.factor(building_code_new)24 -0.2954 0.0536 -5.515 0.0000
as.factor(building_code_new)25 -0.3333 0.0572 -5.829 0.0000
as.factor(building_code_new)26 -0.3806 0.0614 -6.200 0.0000
as.factor(building_code_new)27 -0.1370 0.0553 -2.476 0.0133
as.factor(building_code_new)28 -0.0488 0.0608 -0.802 0.4223
as.factor(building_code_new)29 -0.1355 0.1373 -0.987 0.3236
as.factor(building_code_new)30 -0.0965 0.1137 -0.849 0.3958
as.factor(building_code_new)31 -0.4324 0.1056 -4.095 0.0000
as.factor(building_code_new)32 -0.2662 0.0552 -4.825 0.0000
as.factor(building_code_new)33 -0.1750 0.1521 -1.151 0.2499
as.factor(building_code_new)34 -0.2703 0.1245 -2.171 0.0299
as.factor(building_code_new)36 -0.2197 0.0596 -3.686 0.0002
as.factor(building_code_new)40 0.2749 0.1918 1.433 0.1518
as.factor(building_code_new)41 -0.3382 0.0556 -6.088 0.0000
as.factor(building_code_new)42 -0.7940 0.0732 -10.855 0.0000
as.factor(building_code_new)43 -0.3977 0.0625 -6.368 0.0000
as.factor(building_code_new)44 -0.0396 0.1026 -0.386 0.6996
as.factor(building_code_new)45 -0.3631 0.0721 -5.038 0.0000
as.factor(building_code_new)46 -0.3605 0.1523 -2.367 0.0180
as.factor(building_code_new)47 -0.3513 0.0717 -4.903 0.0000
garage_spaces 0.0703 0.0079 8.893 0.0000
number_of_bathrooms 0.1680 0.0066 25.443 0.0000
number_of_bedrooms -0.0137 0.0056 -2.453 0.0142
I(structure_age^2) 0.0000 0.0000 1.634 0.1023
log(total_livable_area) 0.5833 0.0176 33.170 0.0000
per_bachelors 0.8713 0.0661 13.186 0.0000
per_black -0.6293 0.0183 -34.391 0.0000
per_latino -0.8919 0.0276 -32.292 0.0000
population_density 0.0000 0.0000 16.013 0.0000
median_incomeE.x 0.0000 0.0000 19.140 0.0000

Model Three: Spatial Effects

This model incorporates spatial effects such as distance to parks, distance to center city, proximity to transit, and vacancy data, and crime rates alongside structural and census data.

Show Code
spatial_model <- lm(log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms + number_of_bedrooms + I(structure_age^2) + log(total_livable_area) +
    per_bachelors +
    per_black +
    per_latino +
    population_density +
    median_incomeE.x + dist_to_downtown_miles + dist_to_park_miles + as.factor(close_proximity_to_transit) + as.factor(close_proximity_to_RR) + buildvacpercentage.x + violent_crime_rate.x, data = philly_data_working_hoods)
Show Code
tidy(spatial_model) %>%
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 3),
    p.value = round(p.value, 4)
  ) %>%
  kable(
    col.names = c("Term", "Estimate", "Std. Error", "t Statistic", "P-Value"),
    caption = "Model Coefficients"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Model Coefficients
Term Estimate Std. Error t Statistic P-Value
(Intercept) 8.5883 0.1384 62.046 0.0000
as.factor(exterior_condition)2 -0.0796 0.0489 -1.628 0.1034
as.factor(exterior_condition)3 -0.0603 0.0440 -1.371 0.1703
as.factor(exterior_condition)4 -0.2336 0.0447 -5.229 0.0000
as.factor(exterior_condition)5 -0.5393 0.0511 -10.553 0.0000
as.factor(exterior_condition)6 -0.8402 0.0744 -11.287 0.0000
as.factor(exterior_condition)7 -0.9795 0.0641 -15.291 0.0000
as.factor(exterior_condition)8 -0.6505 0.5518 -1.179 0.2385
as.factor(building_code_new)04 -0.1371 0.1671 -0.820 0.4121
as.factor(building_code_new)05 -0.0869 0.0724 -1.200 0.2303
as.factor(building_code_new)06 -0.1497 0.1191 -1.257 0.2087
as.factor(building_code_new)07 -0.1841 0.0621 -2.967 0.0030
as.factor(building_code_new)08 -0.1802 0.5528 -0.326 0.7445
as.factor(building_code_new)09 -0.0782 0.0679 -1.153 0.2490
as.factor(building_code_new)10 -0.1633 0.0702 -2.328 0.0199
as.factor(building_code_new)13 -0.1379 0.1365 -1.010 0.3125
as.factor(building_code_new)14 -0.1820 0.0724 -2.514 0.0119
as.factor(building_code_new)18 -0.0340 0.2312 -0.147 0.8829
as.factor(building_code_new)21 -0.1125 0.0653 -1.723 0.0848
as.factor(building_code_new)22 -0.4225 0.0544 -7.772 0.0000
as.factor(building_code_new)23 -0.1739 0.0532 -3.267 0.0011
as.factor(building_code_new)24 -0.3151 0.0538 -5.855 0.0000
as.factor(building_code_new)25 -0.3606 0.0580 -6.222 0.0000
as.factor(building_code_new)26 -0.4017 0.0618 -6.495 0.0000
as.factor(building_code_new)27 -0.1425 0.0551 -2.587 0.0097
as.factor(building_code_new)28 -0.0413 0.0606 -0.682 0.4952
as.factor(building_code_new)29 -0.1386 0.1367 -1.014 0.3106
as.factor(building_code_new)30 -0.0719 0.1132 -0.635 0.5257
as.factor(building_code_new)31 -0.4439 0.1053 -4.213 0.0000
as.factor(building_code_new)32 -0.2862 0.0552 -5.187 0.0000
as.factor(building_code_new)33 -0.1888 0.1514 -1.247 0.2124
as.factor(building_code_new)34 -0.2677 0.1240 -2.159 0.0309
as.factor(building_code_new)36 -0.2534 0.0597 -4.247 0.0000
as.factor(building_code_new)40 0.2206 0.1913 1.153 0.2488
as.factor(building_code_new)41 -0.3244 0.0562 -5.769 0.0000
as.factor(building_code_new)42 -0.7272 0.0741 -9.815 0.0000
as.factor(building_code_new)43 -0.4173 0.0634 -6.582 0.0000
as.factor(building_code_new)44 -0.0184 0.1029 -0.179 0.8578
as.factor(building_code_new)45 -0.3566 0.0726 -4.912 0.0000
as.factor(building_code_new)46 -0.3539 0.1517 -2.333 0.0196
as.factor(building_code_new)47 -0.3343 0.0723 -4.625 0.0000
garage_spaces 0.0594 0.0080 7.413 0.0000
number_of_bathrooms 0.1677 0.0067 25.214 0.0000
number_of_bedrooms -0.0115 0.0056 -2.033 0.0421
I(structure_age^2) 0.0000 0.0000 1.564 0.1178
log(total_livable_area) 0.5893 0.0177 33.259 0.0000
per_bachelors 0.6060 0.0715 8.478 0.0000
per_black -0.5699 0.0199 -28.628 0.0000
per_latino -0.8661 0.0283 -30.616 0.0000
population_density 0.0000 0.0000 6.263 0.0000
median_incomeE.x 0.0000 0.0000 15.895 0.0000
dist_to_downtown_miles -0.0176 0.0021 -8.485 0.0000
dist_to_park_miles 0.0471 0.0329 1.432 0.1520
as.factor(close_proximity_to_transit)1 -0.0155 0.0092 -1.681 0.0928
as.factor(close_proximity_to_RR)1 -0.0067 0.0095 -0.707 0.4795
buildvacpercentage.x -0.0108 0.0011 -9.524 0.0000
violent_crime_rate.x -0.0047 0.0004 -13.514 0.0000

Model Four: Interaction and Fixed Effects

This model builds on the spatial effects model and uses neighborhood category as a fixed effect. This includes interacting: 1) garages with population density, as garages are space intensive and thus more expensive in dense areas 2) neighborhood value and total livable area, as livable area is more expensive in valuable expensive regions 3) population density and median income, as this differentiates between suburban wealthy communities and urban wealthy communities 4) building vacancy and median income, as this differentiates between gentrifying and non-gentrifying neighborhoods

Show Code
FE_model <- lm(log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms +number_of_bedrooms + I(structure_age^2) + garage_spaces:population_density + log(total_livable_area) + total_livable_area:as.factor(neighborhood_category) + per_bachelors + per_black + per_latino + log(population_density) + population_density:median_incomeE.x + log(dist_to_downtown_miles) + log(dist_to_park_miles) + as.factor(close_proximity_to_transit) + as.factor(close_proximity_to_RR) + buildvacpercentage.x + buildvacpercentage.x:median_incomeE.x +  median_incomeE.x + log(neighbor_sale_price) + log(violent_crime_rate.x) + as.factor(neighborhood_category), data = philly_data_working_hoods)
Show Code
tidy(FE_model) %>%
  mutate(
    estimate = round(estimate, 4),
    std.error = round(std.error, 4),
    statistic = round(statistic, 3),
    p.value = round(p.value, 4)
  ) %>%
  kable(
    col.names = c("Term", "Estimate", "Std. Error", "t Statistic", "P-Value"),
    caption = "Model Coefficients"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Model Coefficients
Term Estimate Std. Error t Statistic P-Value
(Intercept) 4.8373 0.2863 16.898 0.0000
as.factor(exterior_condition)2 -0.0841 0.0472 -1.783 0.0746
as.factor(exterior_condition)3 -0.0855 0.0424 -2.016 0.0438
as.factor(exterior_condition)4 -0.2407 0.0431 -5.581 0.0000
as.factor(exterior_condition)5 -0.5430 0.0494 -11.001 0.0000
as.factor(exterior_condition)6 -0.8596 0.0719 -11.961 0.0000
as.factor(exterior_condition)7 -1.0033 0.0618 -16.223 0.0000
as.factor(exterior_condition)8 -0.7496 0.5324 -1.408 0.1591
as.factor(building_code_new)04 -0.0870 0.1614 -0.539 0.5899
as.factor(building_code_new)05 -0.0850 0.0702 -1.211 0.2260
as.factor(building_code_new)06 -0.0604 0.1154 -0.524 0.6005
as.factor(building_code_new)07 -0.0620 0.0600 -1.034 0.3012
as.factor(building_code_new)08 -0.2016 0.5333 -0.378 0.7054
as.factor(building_code_new)09 -0.0553 0.0655 -0.844 0.3984
as.factor(building_code_new)10 -0.0273 0.0688 -0.397 0.6915
as.factor(building_code_new)13 -0.0460 0.1318 -0.349 0.7271
as.factor(building_code_new)14 -0.0293 0.0708 -0.414 0.6792
as.factor(building_code_new)18 0.2644 0.2247 1.177 0.2394
as.factor(building_code_new)21 -0.1275 0.0632 -2.018 0.0436
as.factor(building_code_new)22 -0.2961 0.0524 -5.649 0.0000
as.factor(building_code_new)23 -0.0988 0.0517 -1.911 0.0560
as.factor(building_code_new)24 -0.1799 0.0520 -3.456 0.0005
as.factor(building_code_new)25 -0.2676 0.0560 -4.779 0.0000
as.factor(building_code_new)26 -0.2660 0.0598 -4.445 0.0000
as.factor(building_code_new)27 -0.0900 0.0532 -1.690 0.0910
as.factor(building_code_new)28 -0.0272 0.0585 -0.465 0.6419
as.factor(building_code_new)29 -0.1017 0.1320 -0.770 0.4411
as.factor(building_code_new)30 -0.0833 0.1093 -0.762 0.4460
as.factor(building_code_new)31 -0.3037 0.1017 -2.987 0.0028
as.factor(building_code_new)32 -0.1579 0.0534 -2.957 0.0031
as.factor(building_code_new)33 -0.0579 0.1461 -0.396 0.6921
as.factor(building_code_new)34 -0.2699 0.1198 -2.254 0.0242
as.factor(building_code_new)36 -0.1700 0.0576 -2.950 0.0032
as.factor(building_code_new)40 0.1713 0.1846 0.928 0.3536
as.factor(building_code_new)41 -0.2823 0.0548 -5.156 0.0000
as.factor(building_code_new)42 -0.5354 0.0745 -7.191 0.0000
as.factor(building_code_new)43 -0.4094 0.0614 -6.673 0.0000
as.factor(building_code_new)44 0.0010 0.0997 0.010 0.9917
as.factor(building_code_new)45 -0.3205 0.0702 -4.564 0.0000
as.factor(building_code_new)46 -0.3472 0.1466 -2.368 0.0179
as.factor(building_code_new)47 -0.2765 0.0703 -3.934 0.0001
garage_spaces 0.0133 0.0142 0.937 0.3489
number_of_bathrooms 0.1563 0.0065 24.005 0.0000
number_of_bedrooms -0.0072 0.0055 -1.305 0.1918
I(structure_age^2) 0.0000 0.0000 0.489 0.6246
log(total_livable_area) 0.7089 0.0371 19.114 0.0000
per_bachelors 0.3493 0.0713 4.901 0.0000
per_black -0.3113 0.0216 -14.386 0.0000
per_latino -0.4807 0.0300 -16.006 0.0000
log(population_density) -0.0027 0.0115 -0.235 0.8142
log(dist_to_downtown_miles) -0.0837 0.0097 -8.620 0.0000
log(dist_to_park_miles) -0.0030 0.0043 -0.716 0.4739
as.factor(close_proximity_to_transit)1 -0.0111 0.0090 -1.225 0.2205
as.factor(close_proximity_to_RR)1 -0.0017 0.0092 -0.189 0.8500
buildvacpercentage.x -0.0136 0.0023 -5.906 0.0000
median_incomeE.x 0.0000 0.0000 3.124 0.0018
log(neighbor_sale_price) 0.2477 0.0077 32.112 0.0000
log(violent_crime_rate.x) -0.0753 0.0087 -8.694 0.0000
as.factor(neighborhood_category)lower-mid value 0.2259 0.0337 6.698 0.0000
as.factor(neighborhood_category)upper-mid value 0.2932 0.0345 8.510 0.0000
as.factor(neighborhood_category)high value 0.2871 0.0385 7.449 0.0000
as.factor(neighborhood_category)super high value 0.3002 0.0408 7.354 0.0000
garage_spaces:population_density 0.0000 0.0000 2.670 0.0076
total_livable_area:as.factor(neighborhood_category)low value -0.0001 0.0000 -4.409 0.0000
total_livable_area:as.factor(neighborhood_category)lower-mid value -0.0002 0.0000 -6.002 0.0000
total_livable_area:as.factor(neighborhood_category)upper-mid value -0.0002 0.0000 -7.165 0.0000
total_livable_area:as.factor(neighborhood_category)high value -0.0001 0.0000 -5.620 0.0000
total_livable_area:as.factor(neighborhood_category)super high value -0.0001 0.0000 -4.595 0.0000
population_density:median_incomeE.x 0.0000 0.0000 1.076 0.2821
median_incomeE.x:buildvacpercentage.x 0.0000 0.0000 2.614 0.0089

Model Validation

K-Fold Cross Validation and RMSE

Show Code
set.seed(123)
train_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)

cv_model_structural <- train(
  log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms + number_of_bedrooms + I(structure_age^2) + log(total_livable_area),
  data = philly_data_working_hoods,
  method = "lm",
  trControl = train_control,
  na.action = na.omit
)
Show Code
#Finding RMSE
cv_model_census <- train(
log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms + number_of_bedrooms + I(structure_age^2) + log(total_livable_area) +
    per_bachelors +
    per_black +
    per_latino +
    population_density +
    median_incomeE.x,
  data = philly_data_working_hoods,
  method = "lm",
  trControl = train_control,
  na.action = na.omit
)
Show Code
#Finding RMSE
cv_model_spatial <- train(log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms + number_of_bedrooms + I(structure_age^2) + log(total_livable_area) +
    per_bachelors +
    per_black +
    per_latino +
    population_density +
    median_incomeE.x + dist_to_downtown_miles + dist_to_park_miles + as.factor(close_proximity_to_transit) + as.factor(close_proximity_to_RR) + buildvacpercentage.x + violent_crime_rate.x, data = philly_data_working_hoods,
  method = "lm",
  trControl = train_control,
  na.action = na.omit
)
Show Code
#Finding RMSE
cv_model_FE <- train(
  log(sale_price) ~ as.factor(exterior_condition) + as.factor(building_code_new) + garage_spaces + number_of_bathrooms +number_of_bedrooms + I(structure_age^2) + garage_spaces:population_density + log(total_livable_area) + total_livable_area:as.factor(neighborhood_category) + per_bachelors + per_black + per_latino + log(population_density) + population_density:median_incomeE.x + log(dist_to_downtown_miles) + log(dist_to_park_miles) + as.factor(close_proximity_to_transit) + as.factor(close_proximity_to_RR) + buildvacpercentage.x + buildvacpercentage.x:median_incomeE.x +  median_incomeE.x + log(neighbor_sale_price) + log(violent_crime_rate.x) + as.factor(neighborhood_category), data = philly_data_working_hoods,
  method = "lm",
  trControl = train_control,
  na.action = na.omit
)
#Over or under-predicting by 70%!

Plot of RSME and R^2 for each model and Discussion

Show Code
library(knitr)
library(kableExtra)

get_caret_metrics <- function(model, model_name) {
  data.frame(
    Model = model_name,
    R_Squared = round(model$results$Rsquared, 3),
    RMSE = round(exp(model$results$RMSE), 3)  # exp() since you're using log(sale_price)
  )
}

model_results <- rbind(
  get_caret_metrics(cv_model_structural, "Structural"),
  get_caret_metrics(cv_model_census, "Census"),
  get_caret_metrics(cv_model_spatial, "Spatial"),
  get_caret_metrics(cv_model_FE, "Fixed Effects and Interaction")
)

kable(model_results,
      col.names = c("Model", "R²", "RMSE"),
      caption = "Model Comparison") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE)
Model Comparison
Model RMSE
Structural 0.398 1.913
Census 0.548 1.738
Spatial 0.551 1.734
Fixed Effects and Interaction 0.583 1.700

Across our analysis, our best model incorporated spatial features. Though fixed effects fit our data better, it was worse at predicting home values than our spatial data alone. This might be because our assumptions of fixed effects were wrong or theoretically faulty. Structural features certainly mattered the most, as without them, we would expect that the data would explain 40% less of the variation in sales price compared to with structural features.

##Predicted Vs. Actuals

Show Code
make_cv_pred_plot <- function(model, model_name) {
  model$pred %>%
    ggplot(aes(x = exp(obs), y = exp(pred))) +  # exp() to convert back from log
    geom_point(alpha = 0.3) +
    geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
    scale_x_continuous(labels = scales::dollar_format()) +
    scale_y_continuous(labels = scales::dollar_format()) +
    theme_minimal() +
    labs(title = model_name, x = "Actual Sale Price", y = "Predicted Sale Price")
}

p1 <- make_cv_pred_plot(cv_model_structural, "Structural")
p2 <- make_cv_pred_plot(cv_model_census, "Census")
p3 <- make_cv_pred_plot(cv_model_spatial, "Spatial")
p4 <- make_cv_pred_plot(cv_model_FE, "Fixed Effects and Interaction")
grid.arrange(p1, p2, p3, p4, ncol = 2)

Across all of our models we are under predicting home values. This may be due to an overestimation of crime and other dis-amenities’ impact on home values. This may also be due to an over-reliance on structural features to predict home values. To improve the model we might include more spatial data such as school districts and proximity to high-paying jobs.

Spatial Predicted vs. Actuals

Show Code
philly_data_clean <- philly_data_working_hoods %>%
  na.omit() %>%
  mutate(row_id = row_number())

# Join CV predictions back to spatial data
spatial_resids <- cv_model_FE$pred %>%
  mutate(
    rowIndex = as.integer(rowIndex),
    residual = exp(obs) - exp(pred),
    predicted = exp(pred),
    actual = exp(obs)
  ) %>%
  left_join(philly_data_clean %>% mutate(rowIndex = row_number()), by = "rowIndex") %>%
  st_as_sf()

ggplot() +
  geom_sf(data = hoods, fill = "grey90", color = "grey50") +
  geom_sf(data = spatial_resids, aes(color = residual), size = 0.05, alpha = 0.5) +
  scale_color_gradient2(
    low = "blue", mid = "white", high = "red",
    midpoint = 0,
    limits = c(-200000, 200000),  # clamp extreme outliers
    oob = scales::squish,          # squish values outside limits to the ends
    labels = scales::dollar_format(),
    name = "Actual - Predicted"
  ) +
  theme_void() +
  labs(
    title = "Spatial Distribution of Residuals — FE Model",
    subtitle = "Red = Underpredicted | Blue = Overpredicted"
  )

Though our predictions are not especially accurate, we appear to not suffer from chronic over or under prediction in certain neighborhoods. Perhaps we are under predicting in the Riverwards / Kensington, and over-predicting in Queen Village and South Philadelphia, but a spatial pattern is not immediately discernable.

Model Diagnostics

QQPLOT

When looking through the QQ Plots, the first initial concern is that each model carries an S curve - which indicates a tailed distribution and not in line with OLS regression. However, it is important to note that according to the Q-Q plot, the models shows a “light” tail, which means that even though the model produced errors, those errors weren’t so extreme. In the event the model had a heavy tail, we would see points on the plots that are “curving away” from the line. Although we have done a great deal of cleaning the data of extreme outliers, this S-shape might indicate that the models would benefit from further cleaning.

Show Code
structural_resids <- cv_model_structural$pred$obs - cv_model_structural$pred$pred
q1 <- data.frame(residuals = structural_resids) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq(alpha = 0.3) +
  stat_qq_line(color = "red") +
  theme_minimal() +
  labs(
    title = "Q-Q Plot — Structural Model",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )
census_resids <- cv_model_census$pred$obs - cv_model_census$pred$pred
q2 <- data.frame(residuals = census_resids) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq(alpha = 0.3) +
  stat_qq_line(color = "red") +
  theme_minimal() +
  labs(
    title = "Q-Q Plot — Census Model",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

spatial_resids <- cv_model_spatial$pred$obs - cv_model_spatial$pred$pred
q3 <- data.frame(residuals = spatial_resids) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq(alpha = 0.3) +
  stat_qq_line(color = "red") +
  theme_minimal() +
  labs(
    title = "Q-Q Plot — Spatial Model",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )
FE_resids <- cv_model_FE$pred$obs - cv_model_FE$pred$pred
q4<- data.frame(residuals = FE_resids) %>%
  ggplot(aes(sample = residuals)) +
  stat_qq(alpha = 0.3) +
  stat_qq_line(color = "red") +
  theme_minimal() +
  labs(
    title = "Q-Q Plot — Fixed Effects Model",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles"
  )

grid.arrange(q1, q2, q3, q4, ncol = 2)

Cooks Distance

Across all four models, the distribution of values in the cook’s distance diagnostics is rather consistent. For the most part, cooks distance are not above our defined threshold for further investigation. Of note however, before cleaning the data futher, we had an extremely high cooks distance associated with properties wihtout a bathroom. This prompted us to remove all properties without a bathroom.

Show Code
make_cooks_plot <- function(model, model_name) {
  data.frame(
    index = 1:length(cooks.distance(model$finalModel)),
    cooks = cooks.distance(model$finalModel)
  ) %>%
    ggplot(aes(x = index, y = cooks)) +
    geom_bar(stat = "identity") +
    geom_hline(yintercept = 4/nrow(model$finalModel$model), 
               color = "red", linetype = "dashed") +
    theme_minimal() +
    labs(
      title = paste("Cook's Distance -", model_name),
      x = "Observation Index",
      y = "Cook's Distance"
    )
}

c1 <- make_cooks_plot(cv_model_structural, "Structural")
c2 <- make_cooks_plot(cv_model_census, "Census")
c3 <- make_cooks_plot(cv_model_spatial, "Spatial")
c4 <- make_cooks_plot(cv_model_FE, "Fixed Effects and Interaction")
grid.arrange(c1, c2, c3, c4, ncol = 2)

Residual Plot

Show Code
r1 <- data.frame(
  fitted = fitted(cv_model_structural),
  residuals = residuals(cv_model_structural)
) %>%
  ggplot(aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(
    title = "Residuals vs. Fitted - Structural",
    x = "Fitted Values",
    y = "Residuals"
  )

r2 <- data.frame(
  fitted = fitted(cv_model_census),
  residuals = residuals(cv_model_census)
) %>%
  ggplot(aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(
    title = "Residuals vs. Fitted - Census",
    x = "Fitted Values",
    y = "Residuals"
  )

r3 <- data.frame(
  fitted = fitted(cv_model_spatial),
  residuals = residuals(cv_model_spatial)
) %>%
  ggplot(aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(
    title = "Residuals vs. Fitted - Spatial",
    x = "Fitted Values",
    y = "Residuals"
  )

r4 <- data.frame(
  fitted = fitted(cv_model_FE),
  residuals = residuals(cv_model_FE)
) %>%
  ggplot(aes(x = fitted, y = residuals)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = 0, color = "red") +
  theme_minimal() +
  labs(
    title = "Residuals vs. Fitted - Spatial",
    x = "Fitted Values",
    y = "Residuals"
  )

grid.arrange(r1, r2, r3, r4, ncol = 2)

When looking for the perfect model, you want to find your predicted home prices to situate around 0, which means that there is less of a difference between the predicted and actual values. In the case of each of our models, they all had issues predicted values on the lower end of the sale price. Through this analysis of residuals, we can tell that each of our models, performed relatively well predicting for middle range home values.

Conclusion

Our model is not a strong predictor of home values in Philadelphia. Our best model, a spatial features model, is has a 70% margin of error. This means that our does not do a sufficiently good job of capturing home values in Philadelphia. Though we tend to under predict home values, there were not noticeable spatial patterns to our predictions. Furthermore, we tend to be the worse at predicting high-value homes. Given our model’s tendancy to under-predict home values among high value homes, this may result in high-income or wealthy homeowners paying less in property tax compared to lower-income homeowners.

To improve our model in the future, we recommend adding more spatial features such as school district, proximity to high paying jobs, or proximity to cultural amenities. We also recommend using more granular geographies for fixed-effects such as zip codes or census tracts, as in Philadelphia home values can vary drastically within neighborhoods. For example, Brewerytown varies widely in income and home value, and thus this scale might not be a strong fixed effect. Finally, we could improve our model through further data manipulation to generate a normal distribution for all of our data.