For my fifth project I analyzed categorical data that I found on Kaggle that is for detecting fraudulent credit card transactions. I chose this data because of the small percentage of transactions (0.06%) that would need to be properly classified as well as the volume of data provided. This scenario is one that I have yet to see and also a bit out of my wheelhouse as I have less experience in this industry than my previous projects. Naturally I needed to do additional research to get a better understanding of the landscape so that I am aware of potential pitfalls as well as industry standards. This project was very iterative and I can only imagine how much longer it would have taken without this research.
The data that I used in this project was actually fairly clean which is to be expected given that credit card companies would want to protect their customers anonymity. This means that they are willing to provide this data but it comes over very clean as they have taken time with the data before providing it to be analyzed, which is not normally the case. Data cleaning is a big part of any data scientists process when on a project so not having to do much in this stage threw me off at first, although I was very appreciative after a few moments. In this case the iterations required to properly categorize the extremely small percentage of fraudulent charges more than made up for not having to clean the data.
At first I started out by checking to see if there was any need for dimensionality reduction given the size of this data set (284,807, 31). After checking different levels of PCA, I found that the variance explained was fairly insignificant even in extreme cases where nearly all of the features are removed. I ended up choosing to use all the features and given that the amount of features would not necessarily qualify the dataset as ‘Big Data’, the amount of time needed to test each model was not extraneous. I then proceeded to check for independence between variables using a heat map in Seaborn, which I felt was extra helpful as all the feature names were altered. Even with an extensive knowledge of the industry I would not have been able to take any other approach than to look at all features equally in both instances.
Taking this same approach I started modeling the data and began by using a baseline model of a basic Decision Tree Classifier. This ended up only performing slightly better than random and had an accuracy of about 68% which left much room for improvement. I then switched my approach and used an exhaustive search of the decision tree’s parameters to prune and tune the model and see what kind of improvements I could achieve. I did this using sklearn’s GridSearchCV package that was able to provide me with parameters that would increase the models accuracy to 99.95% in the training set and 99.94% in the testing set. This would be a tremendous feat in most cases although in this scenario, the high percentage is very misleading. As mentioned before, the fraudulent transaction rate is 0.06%, meaning that if the model simply marked ALL transactions as non-fraudulent, it would have an accuracy rate of 99.94%. The reason I chose this dataset was to be able to overcome this last 0.06% and that means performing further iterations and testing different models that may be able to overcome this issue.
Next, I broke into my ensemble toolbox and modeled a Random Forest Classifier. As I am sure you are aware, this is a bunch of decision tree classifiers coming together to provide better results by relying on the “Wisdom of the Crowd”. In implementing this model I saw that the baseline model started out at the same accuracy as the Decision Tree Classifier that had its optimal parameters (99.94%) on the training data. This was promising, and I again performed an exhaustive search for the optimal parameters of this model and was able to increase the accuracy on my testing data. Unfortunately, the accuracy only went up to 99.96% and at best was only accounting for 1/3 of the fraudulent transactions, which is not acceptable and not something that I want to attach my name to. Given what I have learned about categorical data models, I knew that getting passed this point was going to be very hard, although thanks to all the data scientists that came before me, there was a model that can handle this task and was readily available to me.
Finally, I broke out my secret weapon: AdaBoost. To go over a refresher on this model, it puts a higher weight on weak learners so that more are introduced into the model and therefore are focused on so that a solution is presented. In this way, it is able to overcome the issues that the last models were unable to classify. The proof of this is seen when the baseline model’s accuracy of this model (83.80%) jumped all the way up to 100.00% accuracy after tuning the model with its optimal parameters. While this was very promising, I was instantly fearful of overfitting my model and expected to see a huge drop in accuracy when introducing my testing data. Fortunately, this was not the case, as the accuracy level remained the same in both the testing data as well new data that the model had yet to see. I mentioned earlier that the dataset had over 250,000 rows, and after my research I knew that I was going to be working on detecting extremely small amounts of fraudulent activity so I did not use all the rows when training/testing the data. I wanted to be sure that the model maintained its integrity when new data was introduced as would be the case once the model was implemented.
I am happy to say that my model held up quite well when new data was introduced and did not see any sort of drop in accuracy. I am happy that I took the time and did my research in this field as there were plenty of opportunities for novice mistakes and to ultimately waste my time spinning my wheels. This project was the most iterative since my initial project, and while it is not as rewarding, it is a part of the process and if done correctly is not nearly as frustrating as my initial project was. I hope you were able to get value out of this blog and a better understanding of dealing with classification problems that are predicting a small percentage of outcomes in a dataset.