Top 10 Best Coolest Movies Denzel Washington of all time

Here we discuss Top 10 Best Coolest Movies Denzel Washington of all time. Denzel Washington has been nominated eight times for an Oscar due to his tremendous acting skills and also as a producer…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Employee Access Prediction with Machine Learning

Amazon Employee Access Challenge

‘ACTION’: ACTION is 1 if the resource was approved, 0 if the resource was not

‘RESOURCE’: An ID for each resource

Training Data
Test Data

I used Logistic Regression, Gradient Boosting, Decision Tree and Random Forest. Among the 4 methods, 2 of them are tree-based models, which predict the value of target variable by splittting the source set into subsets. It turns out Logistic Regression gave me the best result with a Kaggle submission score 0.88, while Decision Tree Model gave the worst result.

In addition, I also applied an ensemble method, combining the two best performing methods ‘Gradient Boosting Classifier’ and ‘Logistic Classifier’ together, which does not give me a better score.

Due to the fact that features play an important role in influencing the predictive model I use and also the results I can achieve, I did some feature engineering on my data. Specifically, I tried three selection techniques: one-hot encoding, filtering unimportant features and hybrid features.

In order to do the one-hot-encoding, I have to combine the training data and testing data then split them apart. After this step, I got better result. However, if we do one-hot-encoding with decision tree model, it is very time consuming.

Eliminating Unimportant Features
Hybrid Features

In order to get the best performance, I tried to optimize estimators by tuning hyperparameters: the number of leaves or depth of a tree, learning rate, inverse of regularization strength, etc, which are “high-level” properties of the model, before training process. The core of hyperoarameter optimization is to run multiple trials in a single training job. And for each trail, there will be a complete execution on the training set with my chosen hyperparameters set within a specified limit. In addition, it keeps track of the result for each trail and make adjustmentes. After all this, I get a complete summary of all the trails with remarks on the most effective values for hyperparameters. With a set of optimal hyperparameters, I improved my learning algorithm. Instead of Gridsearch, I used Randomizedsearch to do the optimization. Since the Gridsearch and Randomizedsearch explore exactly the same space of parameters, but the running time for randomized search is much lower. Here is a sample code of hyperparameter optimization:

Logistic Regression is a powerful tool to model binomial outcome (0 or 1), or the probability of the default class. It explains the relationship between the dependent binary variable, which is ACTION in our dataset and other variables, such as predicts the probability of occurrence of an event by fitting the data into a logit function.

On the other hand, Decision-tree model tends to cause the problem of overfitting: create over-complex trees that do not generalize my training data. Furthermore, as shown above, there are many categorical variables in my dataset with numerical values, attributes with more levels may be favored because of the biased information gain in tree-based models.

In the future, to get better performance, I may…

If the result showing only 0s and 1s, there might be an problem with overfitting. Basically, my learning algorithms continues to develop hypotheses that reduce training set error at the cost of an increased test dataset error. Since the Tree based models tend to overfit the data compared with logistic regression model, I tried to avoid using them. As for underfitting(when the model performs poorly on the training data), adding new domain specific features(feature combination) or using another classifier may help to better capture the relationship between the input values and the target values.

ROC curves is used in assessing the performance of my classifiers over its entire operating range. From the ROC Curve I got, AUC value (a score measuring models’ performance) grows from 0.8 to 0.99.

Gradiant Boosting Classifier ROC
Logistic Regression Classifier ROC

Add a comment

Related posts:

Reflecting on Your future

My May experiment in Innovative Teaching Scholars Program is about trying to help my students reflect on their career aspirations. I decided to use Padlet for my students to share their life and…

Why Buy From Ready Mix Concrete Suppliers?

There are several reasons why one might choose to buy ready mix concrete from a supplier: Consistency: Ready mix concrete suppliers use computer-controlled batching systems to ensure that the…