Choosing a major is a difficult decision almost everyone has to make. This milestone will influence every student’s career trajectory for good or bad.
Young people are encouraged to weight several factors before making a final decision such as:
Since its a major life decision, we want to informe young people about the advantages or disadvantages they may encounter when selecting an specific major. Hence, we will compare the recommendations from three different methods for determining the optimal number of clusters, apply a k-means clustering analysis, and visualize the results.
We will use a year-long survey of 1.2 million people with only a bachelor’s degree by PayScale.Inc that can be found here.
library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(cluster)
library(factoextra)
majors <- read.csv('Us_college.csv')
col_names=c('College.Major', 'Starting.Median.Salary','Mid.Career.Median.Salary',
'Career.Percent.Growth', 'Percentile.10','Percentile.25',
'Percentile.75','Percentile.90')
names(majors) <- col_names
print(head(majors))
## College.Major Starting.Median.Salary Mid.Career.Median.Salary
## 1 Accounting $46,000.00 $77,100.00
## 2 Aerospace Engineering $57,700.00 $101,000.00
## 3 Agriculture $42,600.00 $71,900.00
## 4 Anthropology $36,800.00 $61,500.00
## 5 Architecture $41,600.00 $76,800.00
## 6 Art History $35,800.00 $64,900.00
## Career.Percent.Growth Percentile.10 Percentile.25 Percentile.75 Percentile.90
## 1 67.6 $42,200.00 $56,100.00 $108,000.00 $152,000.00
## 2 75.0 $64,300.00 $82,100.00 $127,000.00 $161,000.00
## 3 68.8 $36,300.00 $52,100.00 $96,300.00 $150,000.00
## 4 67.1 $33,800.00 $45,500.00 $89,300.00 $138,000.00
## 5 84.6 $50,600.00 $62,200.00 $97,000.00 $136,000.00
## 6 81.3 $28,800.00 $42,200.00 $87,400.00 $125,000.00
majors_clean <- majors %>%
mutate_at(vars(Starting.Median.Salary:Percentile.90),
function(x) as.numeric(gsub("[\\$,]","",x))) %>%
mutate(Career.Percent.Growth = Career.Percent.Growth/100)
First step, we have to determine the number of clusters we should model. There are several methods to get to an aproach, since one its not enough for the task. Hence we will use 3 techniques to optimize the number of clusters:
data <- majors_clean %>%
select(Starting.Median.Salary, Mid.Career.Median.Salary,
Percentile.10, Percentile.90) %>% scale()
Our first method compares the total variation within clusters for different values of k to the null hypothesis, making this to maximize the gap. This hypothesis refers to a uniformly distributed simulated reference dataset with no observable clusters, generated by aligning with the principle components of our original dataset.
gap_stat <- clusGap(data, FUN = kmeans, nstart = 25, K.max = 10, B = 50)
gap_static <- fviz_gap_stat(gap_stat)
gap_static
In the above visualization we see how much the variance is explained by k clusters in our dataset, making the ideal value 3 for our clusters.
This method evaluates the clusters quality by how well each point fits within a cluster, maximizing average silhouette width.
silhouette <- fviz_nbclust(data, kmeans, method = "silhouette")
silhouette
Finally, the elbow method, which is the most common to use for this type of situations. This method plots the percent variance against the number of clusters.
elbow <- fviz_nbclust(data, kmeans, method = "wss")
elbow
As you can see there is a break point at the value 3, making the curve to bend like an elbow. This indicates the optimal point at which adding more clusters will no longer explain a significant amount of the variance.
Once we have our results (2, 3 and 3) we can be sure that working with 3 clusters will be the best for our model.
set.seed(354)
num <- 3
k_means <- kmeans(data, num, iter.max = 15, nstart = 25)
majors_labld <- majors_clean %>%
mutate(clusters = k_means$cluster)
Accordingly to this, lets start by getting a view to how each cluster compares in terms of starting vs mid career salaries.
career <- ggplot(majors_labld,
aes(x=Starting.Median.Salary,y=Mid.Career.Median.Salary,
color=factor(clusters))) +
geom_point(alpha=0.75,size=6) +
scale_color_manual(name="Clusters",values=c("#2A9D8F","#E9C46A", "#E76F51")) +
ggtitle('Clusters by Starting vs Mid Career Median Salaries') +
scale_x_continuous(labels = scales::dollar) +
scale_y_continuous(labels = scales::dollar) +
labs(x='Starting Median Salary', y='Mid Career Median Salary')
career
As shown in the visualization the data follows a linear trend meaning that the higher your starting salary is, the higher your mid career salary will be. The clusters provide a level of delineation that also supports this.
Furthermore we also can see two outliners from cluster 1 and 3 that may be explained if we go deeper into our analysis and get insights of the career salaries percentiles.
percentiles <- majors_labld %>%
select(College.Major, Percentile.10, Percentile.25,
Mid.Career.Median.Salary, Percentile.75,
Percentile.90, clusters) %>%
gather(key=percentile, value=salary, -c(College.Major, clusters)) %>%
mutate(percentile=factor(percentile,levels=c('Percentile.10','Percentile.25',
'Mid.Career.Median.Salary','Percentile.75','Percentile.90')))
It seems this cluster is characterized by job stability and sets in the middle of the road in our dataset starting off not too low and not too high in the lowest percentile. It also represents the majors with the greatest difference between the lowest and highest percentiles.
cluster_1 <- ggplot(percentiles[percentiles$clusters==1,],
aes(x=percentile,y=salary,
group=College.Major, color=College.Major, order=salary)) +
geom_point() +
geom_line() +
theme(axis.text.x = element_text(size=7, angle=25)) +
ggtitle('Cluster 1: Applied Sciences Pathway') +
labs(x='Percentile', y='Salary')
cluster_1
If you are good at math and want financial security you should consider one of this mayors. These engineering majors represent the highest growth potential in the 90th percentile, as well as the best security in the 10th percentile rankings. We see one of the outliers, now identifiable as Physician Assistant lagging in the highest percentiles.
cluster_2 <- ggplot(percentiles[percentiles$clusters==2,],
aes(x=percentile,y=salary,
group=College.Major, color=College.Major)) +
geom_point() +
geom_line() +
ggtitle('Cluster 2: Engineering and Math Pathway') +
theme(axis.text.x = element_text(size=7, angle=25)) +
labs(x='Percentile', y='Salary')
cluster_2
Not all is science and math, there are also good majors with a lot of opportunities. If you are passionate about art or have the hearth to help others these are good options. However the majors surrounding this cluster represent the lowest percentiles with limited growth opportunity.
Music major is the riskiest with the lowest 10th percentile salary, but Drama has the highest growth potential in the 90th percentile for this cluster. Nursing is the outlier culprit of this cluster, with a higher safety net in the lowest percentile to the median.
cluster_3 <- ggplot(percentiles[percentiles$clusters==3,],
aes(x=percentile,y=salary,
group=College.Major, color=College.Major)) +
geom_point() +
geom_line() +
ggtitle('Cluster 3: Art and Human Sciences Pathway') +
theme(axis.text.x = element_text(size=7, angle=25)) +
labs(x='Percentile', y='Salary')
cluster_3
Dealing with unsupervised data always requires skill and some creativity. It is recommended the use of not just one method to look after the ideal number of clusters but to use more to be certain and get a good model.
Yes, it is important to focus on high starting career salaries when choosing a major and to consider the growth potential. However, keep in mind that whether a major falls to the sciences, engineering or arts and human sciences cluster, one’s financial destiny is influenced by numerous other factors including the school attended, passion or talent for the subject.
If you are curious about the factors mentioned above, a similar analysis to evaluate them can be conducted on the additional data provided by the Wall Street Journal article, comparing salary potential by type and region of college attended.
Regardless of the salaries, job opportunities and other factors, I want to remind you the most important thing: Follow your passions!
College Major | Start Median Salary | Mid Median Salary | Percent Growth | Percntl 10 | Percntl 25 | Percntl 75 | Perctnl 90 | Cluster |
---|---|---|---|---|---|---|---|---|
Accounting | 46000 | 77100 | 0.676 | 42200 | 56100 | 108000 | 152000 | 3 |
Aerospace Engineering | 57700 | 101000 | 0.750 | 64300 | 82100 | 127000 | 161000 | 2 |
Agriculture | 42600 | 71900 | 0.688 | 36300 | 52100 | 96300 | 150000 | 3 |
Anthropology | 36800 | 61500 | 0.671 | 33800 | 45500 | 89300 | 138000 | 1 |
Architecture | 41600 | 76800 | 0.846 | 50600 | 62200 | 97000 | 136000 | 3 |
Art History | 35800 | 64900 | 0.813 | 28800 | 42200 | 87400 | 125000 | 1 |
Biology | 38800 | 64800 | 0.670 | 36900 | 47400 | 94500 | 135000 | 1 |
Business Management | 43000 | 72100 | 0.677 | 38800 | 51500 | 102000 | 147000 | 3 |
Chemical Engineering | 63200 | 107000 | 0.693 | 71900 | 87300 | 143000 | 194000 | 2 |
Chemistry | 42600 | 79900 | 0.876 | 45300 | 60700 | 108000 | 148000 | 3 |
Civil Engineering | 53900 | 90500 | 0.679 | 63400 | 75100 | 115000 | 148000 | 2 |
Communications | 38100 | 70000 | 0.837 | 37500 | 49700 | 98800 | 143000 | 3 |
Computer Engineering | 61400 | 105000 | 0.710 | 66100 | 84100 | 135000 | 162000 | 2 |
Computer Science | 55900 | 95500 | 0.708 | 56000 | 74900 | 122000 | 154000 | 2 |
Construction | 53700 | 88900 | 0.655 | 56300 | 68100 | 118000 | 171000 | 2 |
Criminal Justice | 35000 | 56300 | 0.609 | 32200 | 41600 | 80700 | 107000 | 1 |
Drama | 35900 | 56900 | 0.585 | 36700 | 41300 | 79100 | 153000 | 1 |
Economics | 50100 | 98600 | 0.968 | 50600 | 70600 | 145000 | 210000 | 2 |
Education | 34900 | 52000 | 0.490 | 29300 | 37900 | 73400 | 102000 | 1 |
Electrical Engineering | 60900 | 103000 | 0.691 | 69300 | 83800 | 130000 | 168000 | 2 |
English | 38000 | 64700 | 0.703 | 33400 | 44800 | 93200 | 133000 | 1 |
Film | 37900 | 68500 | 0.807 | 33900 | 45500 | 100000 | 136000 | 1 |
Finance | 47900 | 88300 | 0.843 | 47200 | 62100 | 128000 | 195000 | 3 |
Forestry | 39100 | 62600 | 0.601 | 41000 | 49300 | 78200 | 111000 | 1 |
Geography | 41200 | 65500 | 0.590 | 40000 | 50000 | 90800 | 132000 | 1 |
Geology | 43500 | 79500 | 0.828 | 45000 | 59600 | 101000 | 156000 | 3 |
Graphic Design | 35700 | 59800 | 0.675 | 36000 | 45500 | 80800 | 112000 | 1 |
Health Care Administration | 38800 | 60600 | 0.562 | 34600 | 45600 | 78800 | 101000 | 1 |
History | 39200 | 71000 | 0.811 | 37000 | 49200 | 103000 | 149000 | 3 |
Hospitality & Tourism | 37800 | 57500 | 0.521 | 35500 | 43600 | 81900 | 124000 | 1 |
Industrial Engineering | 57700 | 94700 | 0.641 | 57100 | 72300 | 132000 | 173000 | 2 |
Information Technology (IT) | 49100 | 74800 | 0.523 | 44500 | 56700 | 96700 | 129000 | 3 |
Interior Design | 36100 | 53200 | 0.474 | 35700 | 42600 | 72500 | 107000 | 1 |
International Relations | 40900 | 80900 | 0.978 | 38200 | 56000 | 111000 | 157000 | 3 |
Journalism | 35600 | 66700 | 0.874 | 38400 | 48300 | 97700 | 145000 | 1 |
Management Information Systems (MIS) | 49200 | 82300 | 0.673 | 45300 | 60500 | 108000 | 146000 | 3 |
Marketing | 40800 | 79600 | 0.951 | 42100 | 55600 | 119000 | 175000 | 3 |
Math | 45400 | 92400 | 1.035 | 45200 | 64200 | 128000 | 183000 | 3 |
Mechanical Engineering | 57900 | 93600 | 0.617 | 63700 | 76200 | 120000 | 163000 | 2 |
Music | 35900 | 55000 | 0.532 | 26700 | 40200 | 88000 | 134000 | 1 |
Nursing | 54200 | 67000 | 0.236 | 47600 | 56400 | 80900 | 98300 | 1 |
Nutrition | 39900 | 55300 | 0.386 | 33900 | 44500 | 70500 | 99200 | 1 |
Philosophy | 39900 | 81200 | 1.035 | 35500 | 52800 | 127000 | 168000 | 3 |
Physician Assistant | 74300 | 91700 | 0.234 | 66400 | 75200 | 108000 | 124000 | 2 |
Physics | 50300 | 97300 | 0.934 | 56000 | 74200 | 132000 | 178000 | 2 |
Political Science | 40800 | 78200 | 0.917 | 41200 | 55300 | 114000 | 168000 | 3 |
Psychology | 35900 | 60400 | 0.682 | 31600 | 42100 | 87500 | 127000 | 1 |
Religion | 34100 | 52000 | 0.525 | 29700 | 36500 | 70900 | 96400 | 1 |
Sociology | 36500 | 58200 | 0.595 | 30700 | 40400 | 81200 | 118000 | 1 |
Spanish | 34000 | 53100 | 0.562 | 31000 | 40000 | 76800 | 96400 | 1 |