Data Science by Nicolas Pedraza

Posts

Showing posts from March, 2024

Module # 10 assignment

March 25, 2024

In this blog post, I'll delve into the world of time series analysis using ggplot2, a powerful data visualization package in R. Time series data involves observations collected or recorded at regular time intervals, making it a crucial area of study in various fields such as finance, economics, and environmental science. Visualizing time series data not only helps in understanding patterns and trends but also aids in making informed decisions based on the insights gained. Code: # Extract year from date year <- function(x) as.POSIXlt(x)$year + 1900 economics$year <- year(economics$date) # Plot unemployment rate over time plot_unemployment <- ggplot(economics, aes(x = date, y = unemploy / pop)) + geom_line() + labs(title = "Unemployment Rate Over Time", x = "Year", y = "Unemployment Rate") + theme_minimal() print(plot_unemployment) This plot provides a clear visualization of how the unemploymen...

Module # 11 Debugging and defensive programming in R

March 25, 2024

The bug in the code lies within the tukey.outlier function. The code for tukey_multiple seems to be designed to identify outliers in each column of a matrix x using the Tukey method, and then determine rows where all columns have outliers. However, there's a bug in the logic of the code. The && operator is used for element-wise logical AND operation, but it should be & for element-wise operation. Fixed CODE: tukey_multiple <- function(x) { outliers <- array(TRUE, dim = dim(x)) for (j in 1:ncol(x)) { outliers[, j] <- tukey.outlier(x[, j]) } outlier.vec <- apply(outliers, 1, all) return(outlier.vec) } In this fixed version, we loop through each column of the input matrix x, identify outliers using tukey.outlier, and store the results in the outliers matrix. Then, we use the apply function to check for rows where all columns have outliers and return a logical vector indicating such rows.

Module # 9 assignment

March 17, 2024

CODE # Load necessary library library(ggplot2) # Create scatterplot matrix scatterplot_matrix <- ggplot(CigarettesB, aes(x = price, y = packs, color = income)) + geom_point() + labs(title = "Scatterplot Matrix of Cigarette Data", x = "Price", y = "Packs", color = "Income") + theme_minimal() # Print the scatterplot matrix print(scatterplot_matrix) Judgment on Multi-Variable Visualization: Multivariate visualization is a powerful way to explore relationships between multiple variables simultaneously. In the context of the cigarette data, a scatterplot matrix allows us to visualize the relationships between the price of cigarettes, the number of packs sold, and the income associated with each observation. This type of visualization enables us to identify patterns, clusters, and trends that may not be apparent when examining variables individually. By visualizing multiple variables...

Module # 10 Building your own R package

March 17, 2024

Package: Pedraza Title: Package for statistical analysis and visualization Version: 0.1.0.9000 Authors@R: "Nicolas Pedraza <nicolas32@usf.edu> [aut, cre]" Description: Pedraza is an R package designed to facilitate statistical analysis and visualization. It provides a set of functions for data manipulation, exploratory data analysis, hypothesis testing, regression analysis, and plotting. With Pedraza, users can efficiently perform various statistical tasks, from simple descriptive statistics to complex modeling techniques. Whether you are a beginner or an experienced data analyst, Pedraza aims to streamline your workflow and enhance your data analysis capabilities. Depends: R (>= 3.1.2) License: CC0 LazyData: true

Module # 9 Visualization in R

March 10, 2024

Basic Histogram: Distribution of Cigarette Prices Our journey begins with a basic histogram, shedding light on the distribution of cigarette prices in the dataset. # Basic Histogram hist(CigarettesB$price, main = "Distribution of Cigarette Prices", xlab = "Price") The histogram vividly illustrates the spread of prices, giving us a glimpse into the variability and concentration within the dataset. Peaks and troughs in the histogram reveal potential clusters or outliers, setting the stage for further exploration. Lattice Scatterplot Matrix: Unveiling Multivariate Relationships Next, we employ a lattice scatterplot matrix, a powerful tool for understanding relationships between multiple variables simultaneously. # Lattice Scatterplot Matrix library(lattice) splom(~CigarettesB[, c("packs", "price", "income")], main = "Scatterplot Matrix") The scatterplot matrix allows us to identify patterns and correlations between "packs,...

Module # 8 Correlation Analysis and ggplot2

March 03, 2024

Exploring Relationships in mtcars: A Visual Analytics Journey In the realm of data analysis, the power of visualization cannot be overstated. Visual analytics provides a unique lens through which we can unravel intricate patterns and relationships hidden within datasets. Inspired by Stephen Few, I embarked on a journey to explore the mtcars dataset using the versatile ggplot2 package in RStudio. In my opinion, Few's recommendation to use a grid is not just a mere organizational suggestion; it's a profound insight into how we perceive and comprehend data visually. The grid layout in our scatter plot matrix not only aids in comparisons but acts as a visual roadmap, guiding us through the complexity of relationships within the dataset. The grid layout becomes a powerful ally in our exploration, allowing us to draw connections and identify trends with efficiency. It serves as a testament to Few's emphasis on simplicity and clarity in visualizations. Conclusion: Grids and Visu...

Module # 8 Input/Output, string manipulation and plyr package

March 03, 2024

Exploring Gender Patterns and Academic Performance: The 'I' Factor Introduction: In a quest to uncover intriguing patterns within student demographics and academic achievements, we delved into a dataset revealing noteworthy insights. Specifically, our analysis focused on students whose names contain the letter 'i'. What emerged from this exploration was a fascinating correlation between the presence of the letter 'i' in a student's name, their gender, and academic performance. Gender Disparities: Our examination revealed a predominant association between the letter 'i' and female names. Of the students with 'i' in their name, a substantial majority were females. Academic Excellence Among 'I'-Named Females: Diving deeper into the academic performance of students with 'i' in their names, a striking trend unfolded. Among these individuals, seven females secured the highest accolade – an 'A' grade. This impressive achiev...