- ML Formulation of the business problem
- Dataset Analysis
- Exploratory Data Analysis
- Feature Engineering | Data preparation
- Custom Stacking Classifier
- Model Deployment
- Future Improvements
This was a competition conducted on Kaggle. The objective of this competition was to build an ML model that will predict an employee’s access needs, given his/her job role.
When a new employee joins the company he or she needs a variety of access to systems and portals at different levels depending on the designation, business unit, role, etc. of the employee. The employee then goes to his supervisor/manager for the approval of those access. Knowledgeable supervisor then takes time to manually grant the needed access in order to overcome access obstacles. As employee moves throughout a company, this access discovery/recovery cycle wastes a nontrivial amount of time and money. Employees are manually allowed or denied access to resources over time.
The aim of this case study is to minimize the human involvement required to grant or revoke employee’s access needs so that manual access transactions are minimized as the employee’s attributes change over time.
2. ML Formulation of the business problem
This is a binary classification problem. The model will take an employee’s information and a resource code and will predict whether or not access should be granted. ‘1’ if the access is granted and ‘0’ otherwise.
- Miss-classification can be costly. Giving an unauthorized access of exclusive portals/resources of the company to an employee can be dangerous.
- No low latency requirement. It is okay if the model takes some time (few seconds/minutes) to give its prediction.
The performance metric used for this case study is Area under the ROC curve (AUC).
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0, one whose predictions are 100% correct has an AUC of 1.0
AUC stands for area under the ROC curve. An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
- True Positive Rate(TPR)- defines how many correct positive results occur among all positive samples available during the test.
- False Positive Rate(FPR)- defines how many incorrect positive results occur among all negative samples available during the test.
3. Dataset Analysis
Data for this case study is real historical data, collected from 2010 & 2011. We are provided with two .csv files- train.csv & test.csv. You can download the data from here.
Train.csv is used for training models. Each row has the ACTION (ground truth), RESOURCE, and information about the employee’s role at the time of approval. Test.csv is the test set for which predictions should be made. Each row asks whether an employee having the listed characteristics should have access to the listed resource.
Now, let’s analyze train.csv by loading train.csv in a pandas dataframe.
There are 32769 rows, 10 columns(9 features & 1 class label). All features are categorical and are numbers(unique ids) of ‘int’ data type. There are zero NULL/NAN values. Let’s see each column separately.
- ACTION : It is the class label(not available for test data). ACTION is 1 if the resource was approved, 0 if the resource was not.
- RESOURCE : An ID for each resource. We have 7518 unique IDs for this feature in train dataset.
- MGR_ID : The EMPLOYEE ID of the manager of the current EMPLOYEE ID record, an employee may have only one manager at a time. Unique ids = 4243.
- ROLE_ROLLUP_1: Company role grouping category id 1 (e.g. US Engineering). Unique ids = 128.
- ROLE_ROLLUP_2 : Company role grouping category id 2 (e.g. US Retail). Unique ids = 177.
- ROLE_DEPTNAME : Company role department description (e.g. Retail). Unique ids = 449.
- ROLE_TITLE : Company role business title description (e.g. Senior Engineering Retail Manager). Unique ids = 343.
- ROLE_FAMILY_DESC : Company role family extended description (e.g. Retail Manager, Software Engineering). Unique ids =2358.
- ROLE_FAMILY : Company role family description (e.g. Retail Manager). Unique ids = 67.
- ROLE_CODE : Company role code, this code is unique to each role (e.g. Manager). Unique ids = 343
4. Exploratory Data Analysis(EDA)
4.1 Distribution of class label
Firstly let’s see the frequency of occurrence of both class labels.
It is quite clear from the above pie chart that the data for this problem is highly imbalanced. Around 94.2% of total requests made by employees are getting approved.
I haven’t used any data balancing technique in this case study, as I didn’t want to use any synthetic data by oversampling and also didn’t want to loose any information by undersampling dataset. Instead I used stratified sampling while splitting data into train & cross validation set for hyperparameter tuning models while training.
4.2 Univariate Analysis of each feature
There are certain categories with more requests than others, but the distribution/density of both the class labels are very similar.
Now, let’s see the percentage of requests getting approved for each category by plotting stacked bar plots. Orange region indicates the number of approved requests out of total requests. As we can see from below images, most of the requests are getting approved from each category.
MGR ID has 4243 unique categories. There were some ids which had lower percentage of requests getting approved. Ids- ‘54618’, ‘19832’ in particular had only around 50% approval rate, which is much lower than the overall average(94%).
ROLE_CODE and ROLE_TITLE have same number of unique categories(343), EDA of these two features also showed some similarity between them. Their stacked bar plots looked exactly same and the numbers(avg, total_requests, approved_requests) for some of the ROLE_CODEs and ROLE_TITLEs were also same. They could just be different names(code & title) given to the same role.
In ROLE_ROLLUP_1, around 21,407 of total 32,769 requests came from single id ‘117961’ .
I have performed similar analysis for all the features, you can check entire EDA in my Github Repository.
4.3 Checking Correlation between features
Let’s first check the dependency of each feature with the class label. Since all the features for this case study is categorical, I used Phi_K Correlation Analyzer Library for checking dependency. Phi_K is a correlation coefficient based on several refinements to Pearson’s hypothesis test of independence of two variables. It has following advantages-
- It works consistently between categorical, ordinal and interval variables.
- It captures non-linear dependency.
- It reverts to the Pearson correlation coefficient in case of a bi-variate normal input distribution.
You can read more about Phi_K here. The value closer to ‘1’ indicates higher dependency, as you can see from the image below, all values are low, so nothing conclusive can be said about which feature is more dependent on the class label.
Now, let’s check correlation between features by plotting heatmap.
ROLE_TITLE and ROLE_CODE have very high value(0.95) indicating they ae highly correlated which is not ideal for our model. As discussed earlier, they are very similar and highly correlated, having both features will not improve the performance of the model, so if we want, we can get rid of any one of these two features for future tasks.
There seems to be no correlation between rest of the features. So we are good to go to feature engineering step.
5. Feature Engineering | Data preparation
5.1 One-Hot Encoding
One hot encoding is a representation of categorical variables as binary vectors. It is appropriate for categorical data where no relationship exists between categories. It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0. Then each feature is replaced with n-dimensional vector(n=no. of unique categories).
I used sklearn’s OneHotEncoder for converting each feature separately , and then stacked them together to get a 15626-dim sparse vector (we have 9 features with 15626 unique categories in total).
The one-hot encoding technique has two main drawbacks:
- For high-cardinality variables(as in our case) — those with many unique categories — the dimensionality of the transformed vector becomes unmanageable.
- The mapping is completely uninformed: “similar” categories are not placed closer to each other in embedding space.
To overcome these drawbacks, let’s try ‘learned embedding’ technique.
5.2 Learned Embedding
Embedding is another way of encoding categorical variables and unlike one hot encoding, the transformed vectors are not sparse. Embedding is the mapping of discrete categorical variables to a vector of continuous numbers.
Each category is mapped to a distinct vector, and the properties of the vector are learned while training a neural network. These are useful because they can reduce the dimensionality of categorical variables(from 15626 to 100 in our case) and meaningfully represent categories in the transformed space.
Here I have used Keras inbuilt embedding layer for this task.
On Original Data
Let’s first check performance on various Machine Learning algorithms using raw data. Tree based models (XGBoost, CATBoost, RandomForest) performed much better than the linear classifiers(LR, SVM) on raw data. I have used GridSearchCV/RandomizedSearchCV for hyperparameter tuning each model with stratified sampling.
On One Hot Encoded Data
Now, let’s check on one hot encoded data. Every model performed well on this data, Logistic Regression and SVM were comparatively better. Because of 15k-dimensional data, hyperparameter tuning XGBoost took some time to give optimal results, rest of the models didn’t take much time.
Neural Network on Embedding
For checking performance on embedded data, I created a simple neural network model with 3 dense layers and 2 dropout layers(for regularization).
Comparing performance of each model
After hyperparameter tuning various models on different datasets, lets evaluate models(with best parameters) on test dataset and submit the results on Kaggle.
These were the results I got on Kaggle-
CatBoost model’s private score of 0.908 was the best among all models.
Let’s see which feature was the most important in predicting correctly by plotting feature importance property of the CatBoost classifier.
RESOURCE feature was most important, followed by ROLE_DEPTNAME & MGR_ID.
Now, let’s built a custom ensemble model by combining best performing models, and see if we can get better results than a single classifier.
7. Custom Stacking Classifier
Now, let’s build a custom stacking classifier by using following steps-
- Split whole data into train and test(80–20).
- Now, in the 80% train set, split the train set into D1 and D2.(50–50). From this D1, do sampling with replacement to create d1,d2,d3….dk(k samples).
- Then create ‘k’ models and train each of these models with each of these k samples.
- Then pass the D2 set to each of these k models, now, we will get k predictions for D2 from each of these models.
- Using these k predictions, create a new dataset, and for D2, we already know it’s corresponding target values, so now we train a metamodel with these k predictions.
- For model evaluation, we can use the 20% data that we have kept as the test set. Pass that test set to each of the base models, and we will get ‘k’ predictions. Now we create a new dataset with these k predictions and pass it to our metamodel, and we will get the final prediction. Using this final prediction and the targets for the test set, we can calculate the model’s performance score.
Here, I performed parameter tuning for number of base models instead of hard coding it. For base models I picked CatBoost, XGBoost, Logistic Regression, RandomForest classifiers and Logistic Regression as meta classifier.
Let’s check Kaggle score-
Model performance was decent but not better than the CatBoost model. So we will go ahead with CatBoost model as our final model for deployment.
8. Model Deployment
I used Flask API for deploying model on AWS EC2 instance. You can check it out here.
Just fill the employee’s information and resource id and click on ‘Register’ button. You will get the message informing you about the request.
9. Future Improvements
- I didn’t create any new feature, just transformed the original features using different encoding techniques. So new features can be constructed by combining 2–3 original features or by using autoencoders or any other technique.
- Improved Neural Network model can be applied.