Data Science by Nicolas Pedraza

Posts

LIS4370[StatVizR - Final Project]

April 26, 2024

StatVizR: Simplifying Statistical Analysis and Visualization in R Nicolas Pedraza LIS4370 Final Project 4/26/2024 Greetings! Today I introduce StatVizR, an R package designed to streamline statistical analysis and visualization tasks. Whether you're a seasoned data analyst or just dipping your toes in data science, StatVizR is here to make your life easier. What is StatVizR? StatVizR is more than just another R package - it's your go-to toolkit for everything related to statistical analysis and visualization. With a comprehensive set of functions, StatVizR empowers users to efficiently manipulate data, explore patterns, conduct hypothesis tests, perform regression analysis, and create insightful visualizations. Why StatVizR? You might be wondering, "Why should I choose StatVizR over other packages?" Well, here are a few reasons: 1. Ease of Use: StatVizR is designed with user-friendliness in mind. Whether you're a begin...

Final Project [LIS4317]

April 25, 2024

Project Documentation: Exploring Top Spotify Songs Introduction: In an era dominated by digital streaming platforms, the music industry has undergone a profound transformation in how music is created, distributed, and consumed. Among these platforms, Spotify stands out as a global leader, providing users with access to a vast library of songs and personalized music recommendations. Understanding the factors that contribute to the success of top Spotify songs is crucial for artists, record labels, and industry analysts seeking to navigate the evolving landscape of music consumption. In this project, we embark on a journey to explore the intricacies of top Spotify songs, aiming to unravel the underlying patterns and trends that shape their popularity. Through a combination of exploratory data analysis (EDA) techniques and visual analytics methodologies, we explore the vast reservoir of Spotify data to uncover insights that inform strategic decision-making a...

Module # 13 {LIS4317}

April 15, 2024

In this blog post, we'll delve into the world of animated scatter plots using R, leveraging the ggplot2 and animation packages to create dynamic visualizations. Animated plots can be powerful tools for conveying changes and patterns over time, making complex data more accessible and engaging. Setting the Stage: To begin, we must ensure that the required R packages (animation and ggplot2) are installed and loaded. These packages offer robust functionality for creating animations and generating sophisticated plots, respectively. Crafting the Animation: Our goal is to construct a series of scatter plots, each representing a distinct frame in our animation. We'll use randomly generated data points within specified limits (-3 to 3 on both the x- and y-axes) to create variability across frames. Bringing It to Life: Upon executing the code, a GIF named scatter_animation.gif will be generated in your working directory. This file contains a sequence of frames, each showca...

Exploring Social Network Visualization with R: Successes and Challenges

April 08, 2024

In the realm of data visualization, exploring social networks can yield fascinating insights into relationships and connectivity. Recently, I delved into this area using R, leveraging packages like GGally, network, sna, and ggplot2. Here’s a recount of my journey, highlighting both successes and challenges encountered along the way. Package Installation and Setup: The initial step was straightforward—installing and loading the necessary packages. Using install.packages() and library() commands, I quickly integrated GGally, network, sna, and ggplot2 into my R environment. Generating Random Network Data: I utilized the rgraph() function from the network package to create a random graph consisting of 10 nodes. Setting mode = "graph" and tprob = 0.5 ensured a symmetric and undirected graph. Visualizing the Network: With the network data prepared, I used ggnet2() from GGally to generate a visualization of the social network. This function creates an aesthetically pleasing graph r...

Module # 11 assignment

April 01, 2024

CODE: > # Load required libraries > library(ggplot2) > library(ggthemes) > > # Sample data > data <- data.frame( + x = rnorm(100), + y = rnorm(100) + ) > > # Create scatter plot > scatter_plot <- ggplot(data, aes(x = x, y = y)) + + geom_point() + + theme_tufte() > > # Display plot > scatter_plot

Module # 10 assignment

March 25, 2024

In this blog post, I'll delve into the world of time series analysis using ggplot2, a powerful data visualization package in R. Time series data involves observations collected or recorded at regular time intervals, making it a crucial area of study in various fields such as finance, economics, and environmental science. Visualizing time series data not only helps in understanding patterns and trends but also aids in making informed decisions based on the insights gained. Code: # Extract year from date year <- function(x) as.POSIXlt(x)$year + 1900 economics$year <- year(economics$date) # Plot unemployment rate over time plot_unemployment <- ggplot(economics, aes(x = date, y = unemploy / pop)) + geom_line() + labs(title = "Unemployment Rate Over Time", x = "Year", y = "Unemployment Rate") + theme_minimal() print(plot_unemployment) This plot provides a clear visualization of how the unemploymen...

Module # 11 Debugging and defensive programming in R

March 25, 2024

The bug in the code lies within the tukey.outlier function. The code for tukey_multiple seems to be designed to identify outliers in each column of a matrix x using the Tukey method, and then determine rows where all columns have outliers. However, there's a bug in the logic of the code. The && operator is used for element-wise logical AND operation, but it should be & for element-wise operation. Fixed CODE: tukey_multiple <- function(x) { outliers <- array(TRUE, dim = dim(x)) for (j in 1:ncol(x)) { outliers[, j] <- tukey.outlier(x[, j]) } outlier.vec <- apply(outliers, 1, all) return(outlier.vec) } In this fixed version, we loop through each column of the input matrix x, identify outliers using tukey.outlier, and store the results in the outliers matrix. Then, we use the apply function to check for rows where all columns have outliers and return a logical vector indicating such rows.

Module # 9 assignment

March 17, 2024

CODE # Load necessary library library(ggplot2) # Create scatterplot matrix scatterplot_matrix <- ggplot(CigarettesB, aes(x = price, y = packs, color = income)) + geom_point() + labs(title = "Scatterplot Matrix of Cigarette Data", x = "Price", y = "Packs", color = "Income") + theme_minimal() # Print the scatterplot matrix print(scatterplot_matrix) Judgment on Multi-Variable Visualization: Multivariate visualization is a powerful way to explore relationships between multiple variables simultaneously. In the context of the cigarette data, a scatterplot matrix allows us to visualize the relationships between the price of cigarettes, the number of packs sold, and the income associated with each observation. This type of visualization enables us to identify patterns, clusters, and trends that may not be apparent when examining variables individually. By visualizing multiple variables...

Module # 10 Building your own R package

March 17, 2024

Package: Pedraza Title: Package for statistical analysis and visualization Version: 0.1.0.9000 Authors@R: "Nicolas Pedraza <nicolas32@usf.edu> [aut, cre]" Description: Pedraza is an R package designed to facilitate statistical analysis and visualization. It provides a set of functions for data manipulation, exploratory data analysis, hypothesis testing, regression analysis, and plotting. With Pedraza, users can efficiently perform various statistical tasks, from simple descriptive statistics to complex modeling techniques. Whether you are a beginner or an experienced data analyst, Pedraza aims to streamline your workflow and enhance your data analysis capabilities. Depends: R (>= 3.1.2) License: CC0 LazyData: true

Module # 9 Visualization in R

March 10, 2024

Basic Histogram: Distribution of Cigarette Prices Our journey begins with a basic histogram, shedding light on the distribution of cigarette prices in the dataset. # Basic Histogram hist(CigarettesB$price, main = "Distribution of Cigarette Prices", xlab = "Price") The histogram vividly illustrates the spread of prices, giving us a glimpse into the variability and concentration within the dataset. Peaks and troughs in the histogram reveal potential clusters or outliers, setting the stage for further exploration. Lattice Scatterplot Matrix: Unveiling Multivariate Relationships Next, we employ a lattice scatterplot matrix, a powerful tool for understanding relationships between multiple variables simultaneously. # Lattice Scatterplot Matrix library(lattice) splom(~CigarettesB[, c("packs", "price", "income")], main = "Scatterplot Matrix") The scatterplot matrix allows us to identify patterns and correlations between "packs,...

Module # 8 Correlation Analysis and ggplot2

March 03, 2024

Exploring Relationships in mtcars: A Visual Analytics Journey In the realm of data analysis, the power of visualization cannot be overstated. Visual analytics provides a unique lens through which we can unravel intricate patterns and relationships hidden within datasets. Inspired by Stephen Few, I embarked on a journey to explore the mtcars dataset using the versatile ggplot2 package in RStudio. In my opinion, Few's recommendation to use a grid is not just a mere organizational suggestion; it's a profound insight into how we perceive and comprehend data visually. The grid layout in our scatter plot matrix not only aids in comparisons but acts as a visual roadmap, guiding us through the complexity of relationships within the dataset. The grid layout becomes a powerful ally in our exploration, allowing us to draw connections and identify trends with efficiency. It serves as a testament to Few's emphasis on simplicity and clarity in visualizations. Conclusion: Grids and Visu...

Module # 8 Input/Output, string manipulation and plyr package

March 03, 2024

Exploring Gender Patterns and Academic Performance: The 'I' Factor Introduction: In a quest to uncover intriguing patterns within student demographics and academic achievements, we delved into a dataset revealing noteworthy insights. Specifically, our analysis focused on students whose names contain the letter 'i'. What emerged from this exploration was a fascinating correlation between the presence of the letter 'i' in a student's name, their gender, and academic performance. Gender Disparities: Our examination revealed a predominant association between the letter 'i' and female names. Of the students with 'i' in their name, a substantial majority were females. Academic Excellence Among 'I'-Named Females: Diving deeper into the academic performance of students with 'i' in their names, a striking trend unfolded. Among these individuals, seven females secured the highest accolade – an 'A' grade. This impressive achiev...

Module # 7 R Object: S3 vs. S4 assignment (R Programming)

February 25, 2024

1. How do you tell what OO system (S3 vs. S4) an object is associated with? You can check if an object has a class attribute using the class() function. If it has a class, it's likely associated with S3. For S4, you can use the showClass() function from the 'methods' package 2. How do you determine the base type (like integer or list) of an object? The typeof() function can be used to determine the base type of an object. Additionally, class() can provide information about the class, which may indicate the type. 3. What is a generic function? A generic function is a function that behaves differently depending on the class of its arguments. It allows you to use the same function name for different methods tailored to specific classes. 4. What are the main differences between S3 and S4? S3 is a simpler and more informal object-oriented system, relying on naming conventions and the class attribute. S4 is a more formal and structured system with an explicit definition of classe...

Module # 7 assignment (Visual Analytics)

February 21, 2024

In this example, I've created two scatter plots, one for mpg vs. hp and another for mpg vs. wt, using mtcars and arranged them in a grid using the grid.arrange function. My opinion on Few's recommendations. Simplicity: Few emphasize simplicity to avoid overwhelming the audience. The scatter plots are relatively simple, displaying two variables at a time (mpg vs. hp, mpg vs. wt). This simplicity aids in easy interpretation without unnecessary complexity. Clarity: Is a key principle in Few's recommendations. The use of grid arrangement helps in comparing the two scatter plots side by side, making it easier for viewers to identify patterns and differences between the variables. Accuracy: This is crucial, and Few often advocate for accurate representation of data. The scatter plots accurately represent the relationship between variables, allowing viewers to make informed observations about the distribution and correlation. Color Usage: Few often advise agai...

Module # 6 assignment (Visual Analytics)

February 19, 2024

Code : # Create a vector for the bar chart x <- c(40, 30, 20, 10) # Display the vector x # [1] 40 30 20 10 # Create a basic bar chart barplot(x) # Add names to the elements of the vector names(x) <- c("Red", "Blue", "Green", "Brown") # Display the updated vector x # Red Blue Green Brown # 40 30 20 10 # Create a bar chart with labels barplot(x) # Define colors for the bar chart mycolors <- c("red", "blue", "green", "brown") # Create a bar chart with custom colors barplot(x, col = mycolors) Clarity : The use of custom colors can enhance the clarity of the bar chart by making it visually appealing and helping distinguish between different categories. However, it's essential to choose colors that don't compromise accessibility for color-blind individuals. Simplicity : While the custom colors can make the chart visually interesting, they might introduce complexity. Too many colors or ...

Module # 6 Doing math in R part 2 (R Programming)

February 19, 2024

1. Consider A=matrix(c(2,0,1,3), ncol=2) and B=matrix(c(5,2,4,-1), ncol=2). a) Find A + B A <- matrix(c(2, 0, 1, 3), ncol = 2) B <- matrix(c(5, 2, 4, -1), ncol = 2) result_addition <- A + B print(result_addition) [,1] [,2] [1,] 7 2 [2,] 5 2 Description: A = [2 0, 1 3 ], B = [ 5 2, 4 -1 ]. Add both A + B = [ 2+5 , 0 +2 , 1 + 4 , 3+(-1) ] = [ 7 2 , 5 2 ] b) Find A - B result_subtraction <- A - B print(result_subtraction) [,1] [,2] [1,] -3 -2 [2,] -3 4 Description: A - B = [ 2 - 5, 0 - 2, 1 -4, 3 - (-1) ] = [ -3 -2, -3 4 ] 2. Using the diag() function to build a matrix of size 4 with the following values in the diagonal 4, 1, 2, diagonal_values <- c(4, 1, 2, 3) result_matrix <- diag(diagonal_values) print(result_matrix) [,1] [,2] [,3] [,4] [1,] 4 0...

Module # 5 assignment (Visual Analytics)

February 12, 2024

In undertaking this week's assignment, the decision was made to employ a scatter plot, a tool recognized for its ability to convey complex data relationships. While considering alternative visualization methods, it became apparent that some options posed challenges in deciphering the intended message, leading to a preference for the clarity offered by the scatter plot. Within the presented data, a notable observation emerges: specifically, 60% of the average time for position 23 is documented at 22.80 seconds. This statistic provides a succinct overview of the temporal dynamics associated with this specific position. A more nuanced perspective is gained when examining positions 30-31, revealing that the average time for 80% of the race extends to 30.40 seconds. Noteworthy is the incremental time difference of 7.6 seconds from the average time of position 23. This underscores the significance of seemingly marginal time differentials in the competitive landscape, illustrating the c...

Module # 5 Doing Math [R Programming]

February 08, 2024

# Creating matrices A and B A <- matrix(1:100, nrow = 10) B <- matrix(1:1000, nrow = 10) # Calculating the inverse of matrix A A_inverse <- solve(A) # Calculating the determinant of matrix B B_det <- det(B) # Printing the results print("Inverse of Matrix A:") print(A_inverse) print("Determinant of Matrix B:") print(B_det) 1. A_inverse <- solve(A): This line calculates the inverse of matrix A using the solve() function. 2. B_det <- det(B): This line calculates the determinant of matrix B using the det() function. 3. print("Inverse of Matrix A:") and print(A_inverse): These lines print the header and the result of the inverse of Matrix A. 4. print("Determinant of Matrix B:") and print(B_det): These lines print the header and the result of the determinant of Matrix B.

Module # 4 assignment [Visual Analytics]

February 04, 2024

In this week's dataset analysis, I opted for a distinctive visualization approach, as illustrated above. Instead of the conventional U.S. map, I found this presentation style to be more intuitive for interpreting the data. Unlike the typical geographic map that aggregates collisions for entire states, our dataset includes data for individual state counties, making it challenging to discern specific details in a standard map view. In the showcased visualization, each square corresponds to a distinct county within various states. This format enables us to pinpoint the exact locations of each accident, offering a granular perspective rather than a holistic view. Unlike the state-level summary presented on a traditional map, this approach allows for a more detailed examination of where each incident occurs within individual counties. As observed in the visualization, the squares positioned towards the lower right corner indicate instances where there are either zero or only one collisi...

Module # 4 Programming structure in R [ R Programming ]

January 30, 2024

[CODE] # Data Frequency <- c(0.6, 0.3, 0.4, 0.4, 0.2, 0.6, 0.3, 0.4, 0.9, 0.2) BloodPressure <- c(103, 87, 32, 42, 59, 109, 78, 205, 135, 176) FirstAssessment <- c(1, 1, 1, 1, 0, 0, 0, 0, NA, 1) SecondAssessment <- c(0, 0, 1, 1, 0, 1, 1, 1, 1, 1) FinalDecision <- c(0, 1, 0, 1, 0, 1, 0, 1, 1, 1) # Create a data frame hospital_data <- data.frame(Frequency, BloodPressure, FirstAssessment, SecondAssessment, FinalDecision) # Boxplot par(mfrow=c(1,2)) # Set up a 1x2 grid for side-by-side plots boxplot(BloodPressure ~ FirstAssessment, data=hospital_data, main="Blood Pressure vs. First Assessment", xlab="First Assessment", ylab="Blood Pressure") boxplot(BloodPressure ~ SecondAssessment, data=hospital_data, main="Blood Pressure vs. Second Assessment", xlab="Second Assessment", ylab="Blood Pressure") # Histogram hist(BloodPressure, main="Histogram of Blood Pressure", xlab="Blood Pressure", ylab=...