Introduction to the data set

The data we use in this project comes from two datasets on Portuguese students and their performance in math (395 observations) and Portuguese (649 observations) courses. 382 students belong to both datasets and while we mainly work with the datasets separately, some of our analysis involves the joint dataset. There are 33 predictors in both datasets involving information such as school, sex, age, information about the students’ study and lifestyle habits, family details, and three grades. We acknowledge that the earlier grades (G1, G2) are helpful in predicting the final grade (G3) and we explore models with and without G1 and G2. The full list and description of predictors can be found at

We first perform Exploratory Data Analysis to understand relationships between predictors and grades, involving interesting trends between grades and school, parents’ jobs, alcohol consumption, health, and family relationships.

Then we move into building a linear model to predict G3 to get an idea of what kind of results we can obtain from a basic model.

We then attempt to explore and answer various questions including: What are the most important variables in predicting a student’s grade? Which variables have a similar effect in predicting grades? What attributes are common among similarly performing students? What attributes are most helpful in predicting the school a student attends? What attributes are most helpful in predicting internet access?

To investigate these questions, we used a variety of prediction methods and models including clustering, random forests, lasso, subset selection, and PCA.

d1 =
d2 =
d3 =

Explatory Data Analysis (EDA)

We will create various plots to understand the relationships between predictors in both the math and Portuguese data sets.

d3 %>%
  gather(`G3.x`, `G3.y`, key="course", value="grade") %>%
  ggplot() +
  geom_bar(aes(x=grade, fill=course), position="dodge") + 
  ggtitle("Distribution of final grades in Math and Portuguese courses") +
  scale_fill_discrete(name = "Course", labels = c("Math", "Portuguese"))

c(mean(d3$G3.x), mean(d3$G3.y))
## [1] 10.38743 12.51571

As seen in the plot and the summary statistics, the mean final grades of the students (who are in both the math and Portuguese courses) in the Portuguese course are higher than the math course.

c(mean(d1$G3), mean(d2$G3))
## [1] 10.41519 11.90601

While the mean final grade of all the students in math (10.42) is not much different than the subset of students in both math and Portuguese (10.39), the mean final grades of the subset of Portuguese students is slightly different (11.91 total vs 12.52 subset). This makes sense because almost all of the students in the math course are in the combined dataset, but there are about 300 additional students in Portuguese dataset.

c(mean(d1$G1), mean(d1$G2), mean(d1$G3))
## [1] 10.90886 10.71392 10.41519
c(mean(d3$G1.x), mean(d3$G2.x), mean(d3$G3.x))
## [1] 10.86126 10.71204 10.38743

For all students in math as well as the subset of math students the mean grade slightly decreases as the semester progresses.

c(mean(d2$G1), mean(d2$G2), mean(d2$G3))
## [1] 11.39908 11.57011 11.90601
c(mean(d3$G1.y), mean(d3$G2.y), mean(d3$G3.y))
## [1] 12.11257 12.23822 12.51571

With Portuguese, the mean grade slightly increases as the semester progresses for both the Portuguese students in the joint dataset (d3) as well as all Portuguese students (d2).

mathGrades <- d1 %>%
  gather(`G1`, `G2`, `G3`, key="semester", value="grade") %>%
  ggplot() +
  geom_bar(aes(x=grade, fill=semester), position="dodge") + 
  ggtitle("Distribution of three grades in Math")
portGrades <- d2 %>%
  gather(`G1`, `G2`, `G3`, key="semester", value="grade") %>%
  ggplot() +
  geom_bar(aes(x=grade, fill=semester), position="dodge") +
  ggtitle("Distribution of three grades in Portuguese")
grid.arrange(mathGrades, portGrades)

Comparing the grade distributions of math and Portuguese, we see the increasing trend in Portuguese. We observe that the decreasing trend in math is most likely due to the increasing number of students with a grade of 0. These plots inform us that G1 and G2 would be effective in predicting G3. The summary statistics above tell us that there are not drastic differences in grades between the subsets and all students in both math and Portuguese.

mathGrades2 <- ggplot(d1) +
  geom_bar(aes(x=school, fill=as.factor(G3)), position="dodge") +
  ggtitle("Distribution of Math grades by school") +
  theme(legend.position = "none")

portGrades2 <- ggplot(d2) +
  geom_bar(aes(x=school, fill=as.factor(G3)), position="dodge") +
  ggtitle("Distribution of Portuguese grades by school") + 
  theme(legend.position = "none")
grid.arrange(mathGrades2, portGrades2)

We see similar trends between both schools in math and Portuguese. In general, the mean grades tend to be higher in the Gabriel Pereira (GP) school.

schoolMath <- ggplot(d1, aes(x=G3)) +
  geom_density(aes(color=school)) +
  ggtitle("Distribution of Math students' grades by school")

schoolPort <- ggplot(d2, aes(x=G3)) +
  geom_density(aes(color=school)) +
  ggtitle("Distribution of Portuguese students' grades by school")
grid.arrange(schoolMath, schoolPort)