Us College Majors Cluster Analysis

Introduction

Choosing a major is a difficult decision almost everyone has to make. This milestone will influence every student’s career trajectory for good or bad.

Young people are encouraged to weight several factors before making a final decision such as:

Employment rates in the field
Advanced degree opportunities
Salary expectations
Overall program cost

Objetive

Since its a major life decision, we want to informe young people about the advantages or disadvantages they may encounter when selecting an specific major. Hence, we will compare the recommendations from three different methods for determining the optimal number of clusters, apply a k-means clustering analysis, and visualize the results.

Data

We will use a year-long survey of 1.2 million people with only a bachelor’s degree by PayScale.Inc that can be found here.

library(ggplot2)
library(tidyr)
library(dplyr)
library(readr)
library(cluster)
library(factoextra)

majors <- read.csv('Us_college.csv')
col_names=c('College.Major', 'Starting.Median.Salary','Mid.Career.Median.Salary',
            'Career.Percent.Growth', 'Percentile.10','Percentile.25',
            'Percentile.75','Percentile.90')
names(majors) <- col_names
print(head(majors))

##           College.Major Starting.Median.Salary Mid.Career.Median.Salary
## 1            Accounting             $46,000.00               $77,100.00
## 2 Aerospace Engineering             $57,700.00              $101,000.00
## 3           Agriculture             $42,600.00               $71,900.00
## 4          Anthropology             $36,800.00               $61,500.00
## 5          Architecture             $41,600.00               $76,800.00
## 6           Art History             $35,800.00               $64,900.00
##   Career.Percent.Growth Percentile.10 Percentile.25 Percentile.75 Percentile.90
## 1                  67.6    $42,200.00    $56,100.00   $108,000.00   $152,000.00
## 2                  75.0    $64,300.00    $82,100.00   $127,000.00   $161,000.00
## 3                  68.8    $36,300.00    $52,100.00    $96,300.00   $150,000.00
## 4                  67.1    $33,800.00    $45,500.00    $89,300.00   $138,000.00
## 5                  84.6    $50,600.00    $62,200.00    $97,000.00   $136,000.00
## 6                  81.3    $28,800.00    $42,200.00    $87,400.00   $125,000.00

Data Cleaning

majors_clean <- majors %>% 
    mutate_at(vars(Starting.Median.Salary:Percentile.90), 
              function(x) as.numeric(gsub("[\\$,]","",x))) %>%
    mutate(Career.Percent.Growth = Career.Percent.Growth/100)

Clustering Analysis

First step, we have to determine the number of clusters we should model. There are several methods to get to an aproach, since one its not enough for the task. Hence we will use 3 techniques to optimize the number of clusters:

Gap static method
Silhouette method
Elbow method

data <- majors_clean %>%
    select(Starting.Median.Salary, Mid.Career.Median.Salary, 
           Percentile.10, Percentile.90) %>% scale()

Gap static method

Our first method compares the total variation within clusters for different values of k to the null hypothesis, making this to maximize the gap. This hypothesis refers to a uniformly distributed simulated reference dataset with no observable clusters, generated by aligning with the principle components of our original dataset.

gap_stat <- clusGap(data, FUN = kmeans, nstart = 25, K.max = 10, B = 50)

gap_static <- fviz_gap_stat(gap_stat)
gap_static

In the above visualization we see how much the variance is explained by k clusters in our dataset, making the ideal value 3 for our clusters.

Silhouette Method

This method evaluates the clusters quality by how well each point fits within a cluster, maximizing average silhouette width.

silhouette <- fviz_nbclust(data, kmeans, method = "silhouette")
silhouette

Finally, the elbow method, which is the most common to use for this type of situations. This method plots the percent variance against the number of clusters.

elbow <- fviz_nbclust(data, kmeans, method = "wss")
elbow

As you can see there is a break point at the value 3, making the curve to bend like an elbow. This indicates the optimal point at which adding more clusters will no longer explain a significant amount of the variance.

K-means Algotithm

Once we have our results (2, 3 and 3) we can be sure that working with 3 clusters will be the best for our model.

set.seed(354)
num <- 3
k_means <- kmeans(data, num, iter.max = 15, nstart = 25)

majors_labld <- majors_clean %>%
    mutate(clusters = k_means$cluster)

Accordingly to this, lets start by getting a view to how each cluster compares in terms of starting vs mid career salaries.

career <- ggplot(majors_labld, 
                 aes(x=Starting.Median.Salary,y=Mid.Career.Median.Salary, 
                     color=factor(clusters))) + 
      geom_point(alpha=0.75,size=6) + 
      scale_color_manual(name="Clusters",values=c("#2A9D8F","#E9C46A", "#E76F51")) + 
      ggtitle('Clusters by Starting vs Mid Career Median Salaries') + 
      scale_x_continuous(labels = scales::dollar) + 
      scale_y_continuous(labels = scales::dollar) + 
      labs(x='Starting Median Salary', y='Mid Career Median Salary')
career

As shown in the visualization the data follows a linear trend meaning that the higher your starting salary is, the higher your mid career salary will be. The clusters provide a level of delineation that also supports this.

Furthermore we also can see two outliners from cluster 1 and 3 that may be explained if we go deeper into our analysis and get insights of the career salaries percentiles.

percentiles <- majors_labld %>%
    select(College.Major, Percentile.10, Percentile.25, 
           Mid.Career.Median.Salary, Percentile.75, 
           Percentile.90, clusters) %>%
    gather(key=percentile, value=salary, -c(College.Major, clusters)) %>%
    mutate(percentile=factor(percentile,levels=c('Percentile.10','Percentile.25',
            'Mid.Career.Median.Salary','Percentile.75','Percentile.90')))

Cluster 1: Applied Sciences Pathway

It seems this cluster is characterized by job stability and sets in the middle of the road in our dataset starting off not too low and not too high in the lowest percentile. It also represents the majors with the greatest difference between the lowest and highest percentiles.

cluster_1 <-  ggplot(percentiles[percentiles$clusters==1,], 
                    aes(x=percentile,y=salary, 
                    group=College.Major, color=College.Major, order=salary)) +
                    geom_point() +
                    geom_line() + 
                    theme(axis.text.x = element_text(size=7, angle=25)) +
                    ggtitle('Cluster 1: Applied Sciences Pathway') +
                    labs(x='Percentile', y='Salary') 
cluster_1

Cluster 2: Engineering and math pathway

If you are good at math and want financial security you should consider one of this mayors. These engineering majors represent the highest growth potential in the 90th percentile, as well as the best security in the 10th percentile rankings. We see one of the outliers, now identifiable as Physician Assistant lagging in the highest percentiles.

cluster_2 <- ggplot(percentiles[percentiles$clusters==2,], 
    aes(x=percentile,y=salary, 
    group=College.Major, color=College.Major)) +
    geom_point() +
    geom_line() +
    ggtitle('Cluster 2: Engineering and Math Pathway') +
    theme(axis.text.x = element_text(size=7, angle=25)) +
    labs(x='Percentile', y='Salary')
cluster_2

Cluster 1: Art and Human Sciences Pathway

Not all is science and math, there are also good majors with a lot of opportunities. If you are passionate about art or have the hearth to help others these are good options. However the majors surrounding this cluster represent the lowest percentiles with limited growth opportunity.

Music major is the riskiest with the lowest 10th percentile salary, but Drama has the highest growth potential in the 90th percentile for this cluster. Nursing is the outlier culprit of this cluster, with a higher safety net in the lowest percentile to the median.

cluster_3 <- ggplot(percentiles[percentiles$clusters==3,], 
    aes(x=percentile,y=salary, 
    group=College.Major, color=College.Major)) +
    geom_point() +
    geom_line() +
    ggtitle('Cluster 3: Art and Human Sciences Pathway') +
    theme(axis.text.x = element_text(size=7, angle=25)) +
    labs(x='Percentile', y='Salary')

cluster_3

Conclusions and recommendations

Dealing with unsupervised data always requires skill and some creativity. It is recommended the use of not just one method to look after the ideal number of clusters but to use more to be certain and get a good model.

Yes, it is important to focus on high starting career salaries when choosing a major and to consider the growth potential. However, keep in mind that whether a major falls to the sciences, engineering or arts and human sciences cluster, one’s financial destiny is influenced by numerous other factors including the school attended, passion or talent for the subject.

If you are curious about the factors mentioned above, a similar analysis to evaluate them can be conducted on the additional data provided by the Wall Street Journal article, comparing salary potential by type and region of college attended.

Regardless of the salaries, job opportunities and other factors, I want to remind you the most important thing: Follow your passions!

Top College Majors Sorted by Career Percent Growth
College Major	Start Median Salary	Mid Median Salary	Percent Growth	Percntl 10	Percntl 25	Percntl 75	Perctnl 90	Cluster
Accounting	46000	77100	0.676	42200	56100	108000	152000	3
Aerospace Engineering	57700	101000	0.750	64300	82100	127000	161000	2
Agriculture	42600	71900	0.688	36300	52100	96300	150000	3
Anthropology	36800	61500	0.671	33800	45500	89300	138000	1
Architecture	41600	76800	0.846	50600	62200	97000	136000	3
Art History	35800	64900	0.813	28800	42200	87400	125000	1
Biology	38800	64800	0.670	36900	47400	94500	135000	1
Business Management	43000	72100	0.677	38800	51500	102000	147000	3
Chemical Engineering	63200	107000	0.693	71900	87300	143000	194000	2
Chemistry	42600	79900	0.876	45300	60700	108000	148000	3
Civil Engineering	53900	90500	0.679	63400	75100	115000	148000	2
Communications	38100	70000	0.837	37500	49700	98800	143000	3
Computer Engineering	61400	105000	0.710	66100	84100	135000	162000	2
Computer Science	55900	95500	0.708	56000	74900	122000	154000	2
Construction	53700	88900	0.655	56300	68100	118000	171000	2
Criminal Justice	35000	56300	0.609	32200	41600	80700	107000	1
Drama	35900	56900	0.585	36700	41300	79100	153000	1
Economics	50100	98600	0.968	50600	70600	145000	210000	2
Education	34900	52000	0.490	29300	37900	73400	102000	1
Electrical Engineering	60900	103000	0.691	69300	83800	130000	168000	2
English	38000	64700	0.703	33400	44800	93200	133000	1
Film	37900	68500	0.807	33900	45500	100000	136000	1
Finance	47900	88300	0.843	47200	62100	128000	195000	3
Forestry	39100	62600	0.601	41000	49300	78200	111000	1
Geography	41200	65500	0.590	40000	50000	90800	132000	1
Geology	43500	79500	0.828	45000	59600	101000	156000	3
Graphic Design	35700	59800	0.675	36000	45500	80800	112000	1
Health Care Administration	38800	60600	0.562	34600	45600	78800	101000	1
History	39200	71000	0.811	37000	49200	103000	149000	3
Hospitality & Tourism	37800	57500	0.521	35500	43600	81900	124000	1
Industrial Engineering	57700	94700	0.641	57100	72300	132000	173000	2
Information Technology (IT)	49100	74800	0.523	44500	56700	96700	129000	3
Interior Design	36100	53200	0.474	35700	42600	72500	107000	1
International Relations	40900	80900	0.978	38200	56000	111000	157000	3
Journalism	35600	66700	0.874	38400	48300	97700	145000	1
Management Information Systems (MIS)	49200	82300	0.673	45300	60500	108000	146000	3
Marketing	40800	79600	0.951	42100	55600	119000	175000	3
Math	45400	92400	1.035	45200	64200	128000	183000	3
Mechanical Engineering	57900	93600	0.617	63700	76200	120000	163000	2
Music	35900	55000	0.532	26700	40200	88000	134000	1
Nursing	54200	67000	0.236	47600	56400	80900	98300	1
Nutrition	39900	55300	0.386	33900	44500	70500	99200	1
Philosophy	39900	81200	1.035	35500	52800	127000	168000	3
Physician Assistant	74300	91700	0.234	66400	75200	108000	124000	2
Physics	50300	97300	0.934	56000	74200	132000	178000	2
Political Science	40800	78200	0.917	41200	55300	114000	168000	3
Psychology	35900	60400	0.682	31600	42100	87500	127000	1
Religion	34100	52000	0.525	29700	36500	70900	96400	1
Sociology	36500	58200	0.595	30700	40400	81200	118000	1
Spanish	34000	53100	0.562	31000	40000	76800	96400	1