1. Overview

In this take-home exercise, I will explore the relationships between different features in participants dataset, and try to find the which kind of people have higher joviality.

2. Getting Started

The packages required are tidyverse (included relevant packages for data analyses such as ggplot2, readr and dplyr), readxl, ggrepel and knitr.

The code chunk below is used to install and load the required packages onto RStudio.

packages = c('tidyverse')

for (p in packages){
  if(!require(p, character.only = T)){
    install.package(p)
  }
  library(p, character.only = T)
}

library(patchwork)

3. Importing Data

The code chunk below import Participants.csv from data folder into R by using read_csv() of readr package and save it as an tibble data frame called participants.

participants <- read_csv("data/Participants.csv")
glimpse(participants)

Rows: 1,011
Columns: 7
$ participantId  <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,~
$ householdSize  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
$ haveKids       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU~
$ age            <dbl> 36, 25, 35, 21, 43, 32, 26, 27, 20, 35, 48, 2~
$ educationLevel <chr> "HighSchoolOrCollege", "HighSchoolOrCollege",~
$ interestGroup  <chr> "H", "B", "A", "I", "H", "D", "I", "A", "G", ~
$ joviality      <dbl> 0.001626703, 0.328086500, 0.393469590, 0.1380~

4. Analysis

From the information of this data set above, we can see that there are 7 different columns in this data set, but participant Id column which presents different participant, has over 1000 unique values. I will not include this feature in my analysis since is not a meaningful feature for participants. As for the rest features, plot the distribution for each one as a preparation for further analysis.

4.1 Distribution of features

The code chunk below plot a bar chart by using geom_bar() by ggplot2 package.

p_householdsize <- ggplot(data = participants,
       aes(x = householdSize))+
  geom_bar(color="grey25", 
           fill="grey90") +
  ggtitle("Distribution of Household Size")

p_havekids <- ggplot(data=participants, 
             aes(x = haveKids)) +
  geom_bar(color="grey25", 
           fill="grey90") + 
  ggtitle("Distribution of Have Kids")

p_age <- ggplot(data=participants, 
             aes(x = age)) +
  geom_histogram(boundary = 100,
                 color="grey25", 
                 fill="grey90") +
  coord_cartesian(xlim=c(16, 61)) +
  ggtitle("Distribution of Age")

p_edu <- ggplot(data=participants, 
             aes(x = educationLevel)) +
  geom_bar(boundary = 100,
                 color="grey25", 
                 fill="grey90") +
  ggtitle("Distribution of Education Level")

p_interest <- ggplot(data=participants, 
             aes(x = interestGroup)) +
  geom_bar(boundary = 100,
                 color="grey25", 
                 fill="grey90") +
  ggtitle("Distribution of Interest Group")

p_joviality<- ggplot(data=participants, 
             aes(x = joviality)) +
  geom_density() +
  ggtitle("Distribution of Joviality")

(p_householdsize/p_havekids/p_interest)|(p_age/p_edu/p_joviality)

In the participants, most of them don’t have kids and have high school or college education level. There is no a dominant interest in these participants, and the joviality distribution is quite uniform. The numbers of participants with different household size are quite equal. The age distribution is ruleless.

4.2 Data Processing

From the distributions figure of features above, we can see that the age feature has too many unique value. We can combine them into different age group for our analysis.

The code chunk below plot a bar chart by using mutate to build a new column based on age and save the new dataset as participants_age.

participants_age <- participants %>%
  mutate(age_group = cut(age, breaks = c(17,25,35,45,55,60)))

4.3 Feature Relationship Exploration

From the distribution figures and the features name, I want to see the relationship between some features.

4.3.1 Household Size and Have Kids

ggplot(data = participants_age,
       aes(x = householdSize, fill = haveKids))+
  geom_bar()+
  ggtitle("Distribution of household size with haveKids feature")

From the chart we can have a conclusion, the participants whose household size equals to 3, all of them have kids.

4.3.2 Other features may influence people’s will to have kids

pk1 <- ggplot(data = participants,
              aes(x = educationLevel, fill = haveKids))+
        geom_bar() +
        
        ggtitle("Have kids according to Education Level")

pk2 <- ggplot(data = participants_age,
              aes(x = age_group, fill = haveKids))+
        geom_bar() +
        ggtitle("Have kids according to Age Group")

pk3 <- ggplot(data = participants_age,
              aes(x = interestGroup, fill = haveKids))+
        geom_bar() +
        ggtitle("Have kids according to Interest Group")

pk1

pk2

pk3

Interest group have higher influence on the situation of having kids than education level and age group.

4.3.3 Age Group and Education Level

ggplot(data = participants_age,
       aes(x = age_group, fill = educationLevel))+
  geom_bar() +
  ggtitle("Age Group according to Education Level")

From different age group, the education level distribution is quite same.

4.3.4 Interest group and kids influence on joviality

ggplot(participants_age, aes(x = interestGroup, y = joviality, fill = haveKids)) +
  introdataviz::geom_split_violin(alpha = .4, trim = FALSE) +
  geom_boxplot(width = .2, alpha = .6, fatten = NULL, show.legend = FALSE) +
  stat_summary(fun.data = "mean_se", geom = "pointrange", show.legend = F, 
               position = position_dodge(.175)) +
  scale_y_continuous(breaks = seq(0, 1, 0.2), 
                     limits = c(0, 1)) +
  scale_fill_brewer(palette = "Dark2", name = "Have Kids")

From this chart we can see the participants in interest group G, if they have kids, they are much happier than those who don’t have kids in the same interest group. The same situation is also shown in interest group I, but it is not obvious as group G. Except these two interest groups, the average joviality don’t change too much according to whether the participants have kids. Although the average of joviality is quite same, but some distribution is quite different, in interest group C, if the participants don’t have kids, they are tend to very happy or very unhappy, while if they have kids, their joviality distribution is quite uniform, don’t have big fluctuation.

Take-home Exercise 1