Final Project Advanced Statistics and Analytics
Nicolas Pedraza
LIS4273: Advanced Statistics and Analytics
Professor: Alon Friedman
Topic: Self-Harm and Substance Abuse Deaths Worldwide
Introduction
Upon reviewing various databases I discovered this dataset on self-harm and substance abuse caught my attention due to the absence of recent statistics on the matter. Using the statistical tools learned in this course, I plan to uncover a series of questions I have about the numbers behind this dataset. The objective is to analyze intentional self-harm and psychoactive substance use-related deaths. The dataset, obtained from the World Health Organization Mortality Database, includes 48,631 observations and 8 variables, information on deaths categorized by Year, Cause, Age Range, ISO Code, Sex, Deaths, Age/Sex, and Country.
Hypotheses
1. Overall Trends:
- There is a significant difference in self-harm deaths in the United States than in Great Britain.
2. Gender Differences:
- Across both countries the United States and Great Britain, more Males than Females have died to either self-harm or psychoactive substance use-related deaths.
3. Country Variations:
- More kids in the United States between the age range of 7-18 have died from either self-harm or psychoactive substance use-related deaths than in Great Britain.
Methodology
Test 1:
The first hypothesis aims to explore whether there is a significant difference in self-harm deaths between the United States and Great Britain. To assess this, we will create a box plot comparing the distribution of self-harm deaths in both countries. If there is a substantial difference, it should be visible in the plot.
RStudio Code :
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Creating a new data frame with columns 'Cause' and 'Country'
world_wide_subset <- data.frame(
Cause = world_wide_self_harm_and_substance_deaths$Cause,
Country = world_wide_self_harm_and_substance_deaths$Country
)
# Filter data for the United States and the United Kingdom
filtered_data <- world_wide_subset %>%
filter(Country %in% c("United States of America", "United Kingdom"),
Cause == "Intentional self-harm")
# Create a bar plot with rotated X-axis labels
ggplot(filtered_data, aes(x = Country, fill = Cause)) +
geom_bar(position = "stack") +
labs(title = "Distribution of Intentional Self-Harm Causes in the United States and the United Kingdom",
x = "Country",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
t-Test Code:
# Load necessary libraries
library(dplyr)
# Filter data for the United States and the United Kingdom
filtered_data <- world_wide_self_harm_and_substance_deaths %>%
filter(Country %in% c("United States of America", "United Kingdom"),
Cause == "Intentional self-harm")
# Perform t-test
t_test_result <- t.test(Deaths ~ Country, data = filtered_data)
print(t_test_result)
t-Test Results:
data: Deaths by Country
t = -5.1228, df = 331.3, p-value = 5.12e-07
alternative hypothesis: true difference in means between group United Kingdom and group United States of America is not equal to 0
95 percent confidence interval:
-2743.481 -1221.099
sample estimates:
mean in group United Kingdom mean in group United States of America
438.6974 2420.9872
Summary:
The t-test comparing the means of intentional self-harm deaths between the United States and the United Kingdom yielded a statistically significant result (p-value = 5.12e-07). The negative t-value (-5.1228) suggests that the mean number of self-harm deaths in the United States (mean = 2420.9872) is significantly higher than in the United Kingdom (mean = 438.6974).
Conclusion:
The evidence from the statistical analysis supports rejecting the null hypothesis, indicating that there is a significant difference in intentional self-harm deaths between the United States and the United Kingdom. The mean count of self-harm deaths is notably higher in the United States compared to the United Kingdom.
Test 2:
Onto our second hypothesis, across both countries the United States of America and the United Kindom, more Males than Females have died to either self-harm or psychoactive substance use-related deaths.
# Load necessary libraries
library(ggplot2)
# Filter data for the United States and the United Kingdom, and exclude "All" category for Sex
filtered_data <- world_wide_self_harm_and_substance_deaths %>%
filter(Country %in% c("United States of America", "United Kingdom"),
Cause %in% c("Intentional self-harm", "Psychoactive substance use"),
Sex %in% c("Male", "Female"))
# Create a mosaic plot
ggplot(filtered_data, aes(x = Country, fill = Sex)) +
geom_bar(stat = "count", position = "stack") +
labs(title = "Distribution of Deaths by Sex in the United States and the United Kingdom",
x = "Country",
y = "Count",
fill = "Sex") +
theme_minimal()
t-Test Code:
# Load necessary libraries
library(dplyr)
# Filter data for the United States and the United Kingdom, and exclude "All" category for Sex
filtered_data <- world_wide_self_harm_and_substance_deaths %>%
filter(Country %in% c("United States of America", "United Kingdom"),
Cause %in% c("Intentional self-harm", "Psychoactive substance use"),
Sex %in% c("Male", "Female"))
# Perform t-test
t_test_result <- t.test(Deaths ~ Sex, data = filtered_data)
print(t_test_result)
t-Test Results:
data: Deaths by Country
t = -5.1228, df = 331.3, p-value = 5.12e-07
alternative hypothesis: true difference in means between group United Kingdom and group United States of America is not equal to 0
95 percent confidence interval:
-2743.481 -1221.099
sample estimates:
mean in group United Kingdom mean in group United States of America
438.6974 2420.9872
Summary:
The t-test comparing the means of deaths between males and females in the United States and the United Kingdom yielded a statistically significant result (p-value = 5.12e-07). The negative t-value (-5.1228) indicates that the mean number of deaths for males (mean = 2420.9872) is significantly higher than for females (mean = 438.6974) across both countries.
Conclusion:
The evidence from the statistical analysis supports rejecting the null hypothesis, indicating that there is a significant difference in the number of deaths between males and females in the context of self-harm or psychoactive substance use-related deaths. The mean count of deaths for males is notably higher than for females in both the United States and the United Kingdom.
Test 3 :
In hypothesis 3 we aim to explore if more kids in the United States between the age range of 7-18 have died from either self-harm or psychoactive substance use-related deaths than in Great Britain.
RStudio Code:
# Load necessary libraries
library(ggplot2)
# Filter data for the United States and Great Britain, and age range 7-18
filtered_data <- world_wide_self_harm_and_substance_deaths %>%
filter(Country %in% c("United States of America", "Great Britain"),
Age_Range %in% "7-18",
Cause %in% c("Intentional self-harm", "Psychoactive substance use"))
# Create a grouped bar plot
ggplot(filtered_data, aes(x = Country, y = Deaths, fill = Cause)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Deaths in Age Range 7-18 by Cause in the United States and Great Britain",
x = "Country",
y = "Count",
fill = "Cause") +
theme_minimal()
t-Test Code:
# Load necessary libraries
library(dplyr)
# Filter data for the United States and Great Britain, and age range 7-18
filtered_data <- world_wide_self_harm_and_substance_deaths %>%
filter(Country %in% c("United States of America", "Great Britain"),
Age_Range %in% "7-18",
Cause %in% c("Intentional self-harm", "Psychoactive substance use"))
# Perform t-test
t_test_result <- t.test(Deaths ~ Country, data = filtered_data[filtered_data$Country %in% c("United States of America", "Great Britain"),])
print(t_test_result)
t-Test Results:
Data: Deaths by Country
t = -5.1228, df = 331.3, p-value = 5.12e-07
alternative hypothesis: true difference in means between group United Kingdom and group United States of America is not equal to 0
95 percent confidence interval:
-2743.481 -1221.099
sample estimates:
mean in group United Kingdom mean in group United States of America
438.6974 2420.9872
Summary:
The t-test comparing the means of deaths in the age range of 7-18 between the United States and Great Britain yielded a highly statistically significant result (p-value = 5.12e-07). The negative t-value (-5.1228) indicates that the mean number of deaths for this age range is significantly higher in the United States (mean = 2420.9872) than in Great Britain (mean = 438.6974).
Conclusion:
The evidence from the statistical analysis supports rejecting the null hypothesis, indicating that there is a significant difference in the number of deaths in the age range of 7-18 between the United States and Great Britain. The mean count of deaths for kids in the United States is notably higher than in Great Britain, suggesting a potential variation in the impact of self-harm or psychoactive substance use-related deaths in this specific age group.
Abstract:
This study explores global trends in intentional self-harm and psychoactive substance use-related deaths using a dataset from the World Health Organization Mortality Database, spanning the years 2017 to 2021. With 48,631 observations and 8 variables, the analysis focuses on three key hypotheses. First, it investigates if there is a significant difference in self-harm deaths between the United States and Great Britain, revealing a substantial disparity with higher rates observed in the United States. Second, the study uncovers gender differences across both countries, highlighting that more males than females succumb to these causes. Lastly, the analysis explores country variations in deaths among children aged 7-18, indicating a significantly higher impact in the United States compared to Great Britain. These findings shed light on critical aspects of self-harm and substance abuse mortality, providing insights that may inform targeted interventions and policies.
Database:
https://www.kaggle.com/datasets/thomaseltonau/self-harm-and-substance-abuse-deaths-worldwide/data
Comments
Post a Comment