In this take-home Exercise, I will explore the relationships between different features in participants dataset.
In this take-home exercise, I will explore the relationships between different features in participants dataset, and try to find the which kind of people have higher joviality.
The packages required are tidyverse (included relevant packages for data analyses such as ggplot2, readr and dplyr), readxl, ggrepel and knitr.
The code chunk below is used to install and load the required packages onto RStudio.
The code chunk below import Participants.csv from data
folder into R by using read_csv()
of readr
package and save it as an tibble data frame called
participants.
participants <- read_csv("data/Participants.csv")
glimpse(participants)
Rows: 1,011
Columns: 7
$ participantId <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~
From the information of this data set above, we can see that there are 7 different columns in this data set, but participant Id column which presents different participant, has over 1000 unique values. I will not include this feature in my analysis since is not a meaningful feature for participants. As for the rest features, plot the distribution for each one as a preparation for further analysis.
The code chunk below plot a bar chart by using geom_bar()
by ggplot2
package.
p_householdsize <- ggplot(data = participants,
aes(x = householdSize))+
geom_bar(color="grey25",
fill="grey90") +
ggtitle("Distribution of Household Size")
p_havekids <- ggplot(data=participants,
aes(x = haveKids)) +
geom_bar(color="grey25",
fill="grey90") +
ggtitle("Distribution of Have Kids")
p_age <- ggplot(data=participants,
aes(x = age)) +
geom_histogram(boundary = 100,
color="grey25",
fill="grey90") +
coord_cartesian(xlim=c(16, 61)) +
ggtitle("Distribution of Age")
p_edu <- ggplot(data=participants,
aes(x = educationLevel)) +
geom_bar(boundary = 100,
color="grey25",
fill="grey90") +
ggtitle("Distribution of Education Level")
p_interest <- ggplot(data=participants,
aes(x = interestGroup)) +
geom_bar(boundary = 100,
color="grey25",
fill="grey90") +
ggtitle("Distribution of Interest Group")
p_joviality<- ggplot(data=participants,
aes(x = joviality)) +
geom_density() +
ggtitle("Distribution of Joviality")
(p_householdsize/p_havekids/p_interest)|(p_age/p_edu/p_joviality)
In the participants, most of them don’t have kids and have high school or college education level. There is no a dominant interest in these participants, and the joviality distribution is quite uniform. The numbers of participants with different household size are quite equal. The age distribution is ruleless.
From the distributions figure of features above, we can see that the age feature has too many unique value. We can combine them into different age group for our analysis.
The code chunk below plot a bar chart by using mutate to build a new column based on age and save the new dataset as participants_age.
From the distribution figures and the features name, I want to see the relationship between some features.
ggplot(data = participants_age,
aes(x = householdSize, fill = haveKids))+
geom_bar()+
ggtitle("Distribution of household size with haveKids feature")
From the chart we can have a conclusion, the participants whose household size equals to 3, all of them have kids.
pk1 <- ggplot(data = participants,
aes(x = educationLevel, fill = haveKids))+
geom_bar() +
ggtitle("Have kids according to Education Level")
pk2 <- ggplot(data = participants_age,
aes(x = age_group, fill = haveKids))+
geom_bar() +
ggtitle("Have kids according to Age Group")
pk3 <- ggplot(data = participants_age,
aes(x = interestGroup, fill = haveKids))+
geom_bar() +
ggtitle("Have kids according to Interest Group")
pk1
pk2
pk3
Interest group have higher influence on the situation of having kids than education level and age group.
ggplot(data = participants_age,
aes(x = age_group, fill = educationLevel))+
geom_bar() +
ggtitle("Age Group according to Education Level")
From different age group, the education level distribution is quite same.
ggplot(participants_age, aes(x = interestGroup, y = joviality, fill = haveKids)) +
introdataviz::geom_split_violin(alpha = .4, trim = FALSE) +
geom_boxplot(width = .2, alpha = .6, fatten = NULL, show.legend = FALSE) +
stat_summary(fun.data = "mean_se", geom = "pointrange", show.legend = F,
position = position_dodge(.175)) +
scale_y_continuous(breaks = seq(0, 1, 0.2),
limits = c(0, 1)) +
scale_fill_brewer(palette = "Dark2", name = "Have Kids")
From this chart we can see the participants in interest group G, if they have kids, they are much happier than those who don’t have kids in the same interest group. The same situation is also shown in interest group I, but it is not obvious as group G. Except these two interest groups, the average joviality don’t change too much according to whether the participants have kids. Although the average of joviality is quite same, but some distribution is quite different, in interest group C, if the participants don’t have kids, they are tend to very happy or very unhappy, while if they have kids, their joviality distribution is quite uniform, don’t have big fluctuation.