Analysis behind the Big Picture: From Motivation to Validation

Featured Image

Image Reference: File:Noun robbery 4270.svg

We describe here our detailed data analysis.


Introduction of motivation

As crime rates can reflect a country’s social stability, we are interested in exploring the crime rates in the United States. Some of our team members major in Economics, so we want to see if economic indicators are related to crime rates. Furthermore, we want to explore if some of the indicators can predict violent crime rates or property crime rates. For predictors, we are interested in economic variables that are related to income (real GDP in millions (chained 2012 dollars), nominal GDP in millions, personal income per capita in current dollars and real median income), poverty (Gini coefficient) and unemployment (unemployment rates). More specifically, we’d like to look into if there are any relationships between the above six indicators and the two types of crime rates. Furthermore, we want to check if the possible relationship can help people predict violent or property crime rates based on certain economic indicators. We focus on data in recent 20 years from 2000 to 2019, and below is the questions we are interested in answering:

  1. What are the overall trends for violent crime rates and property crime rates from 2000 to 2019?
  2. If we use one of the six economic variables mentioned above as a single predictor, and the violent or property crime rates as response variable to make plots, will the points almost line up (which shows that there may be potential linear relationships between the economic indicator and crime rates), or just lies somewhat randomly?
  3. For those plots where points do line up quite well, can we fit linear models respectively? What will be the fitted lines look like among the points?
  4. For the plots where points are close to the fitted lines and distribute evenly among the lines, can the corresponding predictor actually have the ability to predict crime rates? In other words, can we use recent data of certain economic indicators to predict future crime rates?

Breadth of data analysis

We access the data from the FBI’s nationwide program called the Uniform Crime Reporting (UCR), which is relatively reliable. In this project, we explore six economic indicators– real GDP in millions, nominal GDP in millions, personal income per capita in current dollars, real median income, Gini coefficient, and unemployment rates, and their relationship with the crime rates by building linear models. Besides, we also test some of the indicators’ ability of predicting the two types of crime rates. We believe the breadth of the data is appropriate for us to have broad enough analysis.

Depth of data analysis

In this project, we first asked if the six economic indicators have any linear relationship with violent and property crime rates by plotting scatterplots. And we observed that the relationship between certain indicators and crime rates do look like linear. Then, we continued to fit linear models to these indicators and want to see if any of them can actually predicting the crime rates by testing. Overall, we not only pay attention to the trend of crime rates itself, but also put the macro factors such as GDP and median income into consideration. We not only build linear models, but also test the ability of predicting. We think our data analysis is in depth as we first come up with an initial question and dig based on it, and then raise new questions, and continue to explore them.

Modeling and Inference

In this project, we mainly use linear regression to do modeling. We build linear models using nominal GDP, personal income per capita in current dollars and real GDP as individual predictors, whereas violent crime rates and property crime rates as served as individual response variable. Below is the corresponding linear models we built and associated predictions in Big Picture page:

For violent crime rate as response variable:

source(here::here("static/load_and_clean_data.R"))
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## New names:
## * `` -> ...1
## * `` -> ...2
#nominal GDP as predictor:
current_GDP_violent_model <-
  lm(violent_crime_rate ~ current_dollar_GDP, 
     data = US_no_Gini)

current_GDP_violent_model_pred <-
  US_no_Gini %>%
  add_predictions(current_GDP_violent_model)

#real GDP as predictor:
real_GDP_violent_model <-
  lm(violent_crime_rate ~ real_GDP, 
     data = US_no_Gini)

real_GDP_violent_model_pred <-
  US_no_Gini %>%
  add_predictions(real_GDP_violent_model)

#personal income per capita in current dollars as predictor:
income_capita_violent_model <-
  lm(violent_crime_rate ~ income_per_capita, 
     data = US_no_Gini)

income_capita_violent_model_pred <-
  US_no_Gini %>%
  add_predictions(income_capita_violent_model)

For property crime rate as response variable:

#nominal GDP as predictor:
current_GDP_property_model <- 
  lm(property_crime_rate ~ current_dollar_GDP, 
     data = US_no_Gini)

current_GDP_property_model_pred <-
  US_no_Gini %>%
  add_predictions(current_GDP_property_model)

#real GDP as predictor:
real_GDP_property_model <-
  lm(property_crime_rate ~ real_GDP, 
     data = US_no_Gini)

real_GDP_property_model_pred <-
  US_no_Gini %>%
  add_predictions(real_GDP_property_model)

#personal income per capita in current dollars as predictor:
income_capita_property_model <-
  lm(property_crime_rate ~ income_per_capita, 
     data = US_no_Gini)

income_capita_property_model_pred <-
  US_no_Gini %>%
  add_predictions(income_capita_property_model)

Besides, to test whether certain economic variables in recent data are efficient to predict crime rates, we use data from 2000 to 2014 to build linear models and use the models to predict the possible crime rates in 2015–2019. We then compare the predictions to the actual crime rates.

Below is how we test the models for violent crime rates:

train_data <-
  US_no_Gini %>%
  filter(year == "2000" : "2009" | year == "2010" : "2014")


test_data <-
  US_no_Gini %>%
  filter(year == "2015": "2019")

#nominal GDP as predictor:
current_GDP_violent_model_test <-
  lm(violent_crime_rate ~ current_dollar_GDP, 
     data = train_data)

current_GDP_violent_model_test_pred <-
  test_data %>% 
  add_predictions(current_GDP_violent_model_test)

#real GDP as predictor:
real_GDP_violent_model_test <-
  lm(violent_crime_rate ~ real_GDP, 
     data = train_data)

real_GDP_violent_model_test_pred <-
  test_data %>% 
  add_predictions(real_GDP_violent_model_test)

#personal income per capita in current dollars as predictor:
income_capita_violent_model_test <-
  lm(violent_crime_rate ~ income_per_capita, 
     data = train_data)

income_capita_violent_model_test_pred <-
  test_data %>% 
  add_predictions(income_capita_violent_model_test)

Below is how we test the models for property crime rates:

#nominal GDP as predictor:
current_GDP_property_model_test <- 
  lm(property_crime_rate ~ current_dollar_GDP, 
     data = train_data)

current_GDP_property_model_test_pred <-
  test_data %>% add_predictions(current_GDP_property_model_test)

#real GDP as predictor:
real_GDP_property_model_test <-
  lm(property_crime_rate ~ real_GDP, 
     data = train_data)

real_GDP_property_model_test_pred <-
  test_data %>% add_predictions(real_GDP_property_model_test)

#personal income per capita in current dollars as predictor:
income_capita_property_model_test <-
  lm(property_crime_rate ~ income_per_capita, 
     data = train_data)

income_capita_property_model_test_pred <-
  test_data %>% add_predictions(income_capita_property_model_test)

To validate our results, we did the following:

We use the cross validation method to validate our result. We choose the leave-one-out cross validation like we did in the lecture and calculate the mse for each individual indicator. Then we take the square root of each mse value to make the unit of the error be same as the crime rates, and arrange the values in an ascending order. Since the Gini coefficient only has record until 2018, so built two datasets: US_with_Gini and US_no_Gini. US_with_Gini contains all the six economic indicators but only up to 2018, while US_no_Gini contains five economic indicators (without Gini coefficient) but up to 2019.

US_with_Gini <-
  US_with_Gini %>%
  filter(!is.na(Gini_coefficient))

crimeratecv1 <- crossv_loo(US_with_Gini)
crimeratecv2 <- crossv_loo(US_no_Gini)

cvmse <- function(formulas, cv){
  cv %>% mutate(mod = map(train, ~lm(formulas, data = .))) %>%
    mutate(mse = map2_dbl(mod, test, mse)) %>%
    pull(mse) %>%
    mean()
}

We did the cross validation four times (2 datasets times 2 response variables):

(1). violent crime rate as response variable and with all six economic indicators

#(1)
table_one_predictor_violent_with_gini<-
  formulas(~violent_crime_rate,
           realgdp = ~real_GDP,
           nominalgdp = ~current_dollar_GDP,
           unemploymentrates = ~average_rates, 
           medianincome = ~median_income, 
           incomepercapita = ~income_per_capita, 
           gini = ~Gini_coefficient) %>%
  map_df(cvmse, crimeratecv1) %>% 
  pivot_longer(everything(), values_to ="result", names_to = "model") %>% 
  mutate(result = sqrt(result)) %>%
  arrange(result)
table_one_predictor_violent_with_gini
## # A tibble: 6 x 2
##   model             result
##   <chr>              <dbl>
## 1 nominalgdp          23.9
## 2 incomepercapita     24.7
## 3 realgdp             26.7
## 4 gini                50.9
## 5 unemploymentrates   52.6
## 6 medianincome        56.3

(2). violent crime rate as response variable and with five economic indicators

#(2)
table_one_predictor_violent_no_gini<-
  formulas(~violent_crime_rate,
           realgdp = ~real_GDP,
           nominalgdp = ~current_dollar_GDP,
           unemploymentrates = ~average_rates, 
           medianincome = ~median_income, 
           incomepercapita = ~income_per_capita) %>%
  map_df(cvmse, crimeratecv2) %>% 
  pivot_longer(everything(), values_to ="result", names_to = "model") %>% 
  mutate(result = sqrt(result)) %>%
  arrange(result)
table_one_predictor_violent_no_gini
## # A tibble: 5 x 2
##   model             result
##   <chr>              <dbl>
## 1 nominalgdp          24.3
## 2 incomepercapita     25.0
## 3 realgdp             26.7
## 4 unemploymentrates   54.6
## 5 medianincome        58.8

(3). property crime rate as response variable and with all six economic indicators

#(3)
table_one_predictor_property_with_gini<-
  formulas(~violent_crime_rate,
           realgdp = ~real_GDP,
           nominalgdp = ~current_dollar_GDP,
           unemploymentrates = ~average_rates, 
           medianincome = ~median_income, 
           incomepercapita = ~income_per_capita, 
           gini = ~Gini_coefficient) %>%
  map_df(cvmse, crimeratecv1) %>% 
  pivot_longer(everything(), values_to ="result", names_to = "model") %>% 
  mutate(result = sqrt(result)) %>%
  arrange(result)
table_one_predictor_property_with_gini
## # A tibble: 6 x 2
##   model             result
##   <chr>              <dbl>
## 1 nominalgdp          23.9
## 2 incomepercapita     24.7
## 3 realgdp             26.7
## 4 gini                50.9
## 5 unemploymentrates   52.6
## 6 medianincome        56.3

(4). property crime rate as response variable and with five economic indicators.

table_one_predictor_property_no_gini<-
  formulas(~violent_crime_rate,
           realgdp = ~real_GDP,
           nominalgdp = ~current_dollar_GDP,
           unemploymentrates = ~average_rates, 
           medianincome = ~median_income, 
           incomepercapita = ~income_per_capita) %>%
  map_df(cvmse, crimeratecv2) %>% 
  pivot_longer(everything(), values_to ="result", names_to = "model") %>% 
  mutate(result = sqrt(result)) %>%
  arrange(result)
table_one_predictor_property_no_gini
## # A tibble: 5 x 2
##   model             result
##   <chr>              <dbl>
## 1 nominalgdp          24.3
## 2 incomepercapita     25.0
## 3 realgdp             26.7
## 4 unemploymentrates   54.6
## 5 medianincome        58.8

We find that for all of the four models, the three least square root of mse values always occur when nominal GDP, real GDP and personal income per capita in current dollars are individual indicator. Besides, the while the square root of mse of nominal GDP, real GDP and personal income per capita in current dollars are all less than 30, the square root of mse of the rest indicators all exceed 50. This matches our initial finding in the big picture that only the points on the scatterplots where nominal GDP, real GDP and personal income per capita in current dollars are individual indicator can basically form straight lines.

In addition, the cross validation can also explain why nominal GDP, real GDP and personal income per capita in current dollars can predict property crime rate well but not violent crime rate. We can see that although the square root of mse of nominal GDP, real GDP and personal income per capita in current dollars are all less than 30 for both violent crime rate and property crime rate as individual response variable, the magnitude of two crime rates are actually different. If we go to table_1_edit and check, we can see that violent crime rate are in hundreds (for every 100000 people) while property crime rate are in thousands (for every 100000 people). Therefore, a deviation of 30 matters more for violent crime rate than property crime rate. That validates our conclusion that nominal GDP, real GDP and personal income per capita in current dollars have good ability of predicting property crime rate well but cannot predict violent crime rate well.

Overall, from our modeling, we found that recent data shows that nominal GDP, real GDP and personal income per capita in current dollars can predict property crime rate well but cannot predict violent crime rate well. However, in the mean time, we realize that, even though the three economic indicators can have relatively good predictions about the property crime rates, there are some limitations about our modeling and conclusion. We only choose six economic indicators among many indicators, and there could be other economic indicators that might be more directly relevant to the crime rates. Besides, we only consider linear models with one predictor at a time. There could be other models and combination of many predictors. In addition, not only can economy affect the crime rates, but also other political, social factors and policies can play a part. Some of the factors may also influence each other. The influential factors are complex and it’s hard to separate them into certain categories. We believe we can still make progress by studying more of them. There is a long way to go to create a relatively accurate model to predict the crime rates.

Clarity Figures

According to the pictures/ figures we put in the post and Big Picture section, they are very easy to read and informative. At first, the numbers on the x-axis are too crowded to read, but we finally figure out a way to separate them (that is, not to make all numbers display on the same row). Besides, all plots have clear labels in x-axis and y-axis and effectively convey some straightforward and intuitive meanings to help us further our analysis. Moreover, at first we had many plots but we finally combine them by using facet_wrap, which makes it easier to compare and contrast different plots.

Clarity of Explanations

In each analysis, we have a clear introduction and explanation. We label the x and y axis when we put figure in the analysis to make the figures more intuitive and straightforward, which can help explain our results. Besides, we also write to explain what is on the x-axis and what is on the y-axis. In addition, We also explain what the plots are for, that is, why they are here and what they are trying to show. For the observed phenomenon, we provide interpretations that suggest further analysis. We mostly analyze how may the economic factors potentially predict the crime rates, we suggest that there may also be other factors that might also play roles in increasing/ decreasing crime rates.

Organization and cleanliness

Our code is easy to read and we make our content into different sections/parts.

References:

“Count the Number of Characters (or Bytes or Width).” https://stat.ethz.ch/R-manual/R-devel/library/base/html/nchar.html. Accessed 09 April 2021.

“Databases, Tables & Calculators by Subject.” U.S. Bureau of Labor Statistics, https://data.bls.gov/timeseries/LNS14000000?years_option=all_years. Accessed 19 March 2021.

“Extract data frame cell value.” datacamp, https://campus.datacamp.com/courses/model-a-quantitative-trading-strategy-in-r/chapter-1-introduction-to-r-for-trading?ex=4. Accessed 09 April 2021.

MacQueen, Don. “[R] Extract year from date.” 09 March 2015, https://stat.ethz.ch/pipermail/r-help/2015-March/426643.html. Accessed 08 April 2021.

“Real Median Household Income in the United States.” FRED, https://fred.stlouisfed.org/series/MEHOINUSA672N. Accessed 01 April 2021.

“Rename Data Frame Columns in R.” Datanovia, https://www.datanovia.com/en/lessons/rename-data-frame-columns-in-r/. Accessed 08 April 2021.

SAGDP2N Gross domestic product (GDP) by state 1/Gross domestic product (GDP) by state: All industry total (Millions of current dollars)." Bureau of Economic Analysis, https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1. Accessed 01 April 2021.

“SAGDP9N Real GDP by state 1/Real GDP by state: All industry total (Millions of chained 2012 dollars).” Bureau of Economic Analysis, https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1. Accessed 01 April 2021.

“Specify cells for reading.” readxl, https://readxl.tidyverse.org/reference/cell-specification.html. Accessed 08 April 2021.

“Table 1.” FBI:UCR, https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/topic-pages/tables/table-1. Accessed 19 March 2021.

“World Development Indicators.” THE WORLD BANK, https://databank.worldbank.org/reports.aspx?source=2&series=SI.POV.GINI&country=USA. Accessed 01 April 2021.

File:Noun robbery 4270.svg.” WIKIMEDIA COMMONS, https://commons.wikimedia.org/wiki/File:Noun_robbery_4270.svg. Accessed 05 May 2021.(header picture)