Module 3 Project: Statisctical Analysis (In Depth Walk Through)

I am sure that at least once in a data scientists career you will work with Microsoft’s Northwind Database in some aspect. For me, this was statistical analysis of the data in regards to four hypotheses. Only one hypothesis was given and the other three were left up to me, which I am sure is not going to be the case when employed as a data scientist. I thoroughly enjoyed the freedom of this project as I saw it as an opportunity to showcase both my ability to produce an answer for any project given, as well as show insights that may be going unnoticed. The latter is of higher value in my eyes as it not only shows an “above and beyond” attitude, it opens up opportunities for future projects that will hopefully involve my input.

Initially I was more concerned about digesting and implementing all of the statistical methods that I learned leading up until this project, ironically, it was an issue with the database syntax that was my most frustrating problem. While this is not ideal, I am grateful for that experience, as it lifted the tunnel vision I had and helped to build confidence in all that I learned. With that being said, I will go into a high level overview of my second project as a data scientist.

Before I get into each hypotheses’ overview, I wanted to go over the tools that I used throughout this project. Using python I accessed the database via SQLite and used SQL commands to join tables as needed throughout the project. Any alteration to the data frame after the initial extraction from the database was done using pandas as I prefer it and feel significantly more comfortable manipulating data using pandas. All visualizations were done via matplotlib, normally I prefer seaborn but since the visualizations were not going to be used in the presentation after my project, I stuck with matplotlib. All of the simple calculations were done as either numpy or pythons built in functions such as sum( ) and abs( ), while all statistical analysis were calculated using scipy functions that will be talked about in more detail in each hypothesis’ overview.

The null hypothesis given to me to test was: “Discounted goods do NOT have a statistically significant effect on the quantity ordered”. This was a fairly straightforward H0 to test and did not require but one table from the SQL database to gather all the data needed. For the initial H0, I was able to gather the information needed by accessing the ProductDetail table. This table had data on whether the product was discounted, to what degree, and the quantity ordered. Using a simple for loop I was able to extract the quantity ordered for both discounted orders and non-discounted orders and then place them in separate lists. These two lists were then plotted and compared as CDFs on the same plot to get the KS2 statistic (0.12), which indicated with a high degree of certainty (p-value was .0476), that there was little difference between the two distributions and could therefore be compared without having to worry about the normality assumption. I was then able to implement a Monte Carlo method. As I am sure you are aware, the reason I said “a” Monte Carlo method is due to the vague outline of the method, which I actually enjoyed. The freedom to create this method based on the data at hand, is really appealing, and quite frankly empowering. I am not going to lie, I went back to study the method from multiple sources just to increase my level of confidence before creating and implementing the method. It was definitely a high point in the project and, if I am being honest, a confidence booster. If you want the exact method used, I have included a link here to the project where you can take a look for yourself. My Monte Carlo method ran 10,000 times and was able to come up with the mean for both discounted and non-discounted purchases (27.09 Units and 21.71 Units respectively). Using Cohen’s D to get a standardized effect size of 8.91, it is clear that discounts have a HUGE effect on quantity sold. With this information we are able to reject the H0 and accept the H1 (Discounted goods do have a statistically significant effect on the quantity ordered”). Finally, I ran the ANOVA method to answer a secondary question asked of me: “If the H0 is rejected, which discounts provide the most effect?” I categorized the Discount column data and set the target column to Quantity. The results indicated that the discounts that had the most influence were 5%, 15% and 25%, although it is worth mentioning that all discounts above 5%, excluding 6%, were almost as effective the discounts initially mentioned.

The initial hypothesis brought on my second H0: “Discounts do NOT have a statistically significant effect on the average (mean) revenue per order.” My initial thoughts that were to reject the H0 as the average revenue brought in was going to decrease due to the discount provided. I used the exact same process as above to test this H0, although to my dismay, the effect size was very small (Cohen’s D = 0.12) and the difference in average revenue per type of order, as generated by the Monte Carlo method, was 27 cents. While it is not ideal to have to accept the H0, ultimately you have to listen to the data and results that it provides.

The third H0: “Customers do NOT have a statistically significant bias towards purchasing discounted goods” was actually the most challenging. This was mainly due to the SQL database committing the cardinal sin of giving one of its table names a name that is also a SQL command (Order). As much as I wish I could say that I knew exactly what the issue was right off the bat, I explored every other “solution” before it finally clicked. To add to the degree of difficulty this is the first time I needed to join tables via SQL commands, although it is not so much that it is difficult but that it was harder to trust the results given the previous mishap. Ultimately I was able to confirm that the join was successful and my confidence was restored in the results I was about to produce. The process I used for this hypothesis was similar to the previous two but with the following differences. I created a list of each customers discounted and non-discounted purchasing percentages. Then, using the KS method to test the discounted percentages against a normal distribution, I was able to reject (KS Statistic: .52), with a high degree of certainty (p-value = .000), the KS H0 stating that the two distributions are nearly identical. This indicated that the distribution of discounted percentages was skewed. After plotting a histogram of the data, it was clear that it was negatively skewed, which indicates that customers DO have a bias towards discounted goods. It is worth mentioning that the histogram showed that over 30 customers ONLY bought discounted goods. This is also confirmed with Cohen’s D statistic (1.55) indicating that the effect is very large. With this I was able to reject the H0 and accept the H1: “Customers DO have a statistically significant bias towards purchasing discounted goods”. These results confirmed my initial assumption although I must say that I was not expecting this huge of an effect on customers purchasing habits.

The last H0: “The perfect order fulfillment is below the industry standard index of 90%” was actually very simple but was something that starting to concern me as I was exploring the database. I was noticing that a lot of the “Units On Order” were significantly less than the “Units In Stock” when I was looking at the Products table. After calculating the percentage of units that were on backorder, I had promising results, 38% of all units were on backorder. Given this, I was certain that more suppliers were needed to ensure that products are delivered on time and in doing so, not impacting the customers experience negatively. The POF is calculated with the following formula: (P of orders delivered on time) x (P of orders completed) x (P of damaged free orders) x (P of orders with accurate documentation) x 100. The first two percentages were calculable, and in doing so, produced the POF index of 95%. This is well above the industry standard and allows us to reject the H0 as there is still room in the POF index incase there were orders that did not have proper documentation or were not damage free. Although my initial insight was off, ultimately the fact that a healthy percentage of orders are delivered on time is a good thing. The biggest takeaway from this hypothesis is that more data is needed to accurately calculate this very common supply chain statistic and this data should be collected going forward.

This project had a lot of variety and was very fulfilling in that I was able to handle the different obstacles that presented themselves. Using different statistical functions based of the circumstances at hand reinforced my confidence in the materials learned throughout the lessons leading up to this project. The value that I am able to provide is why I decided to become a data scientist and I look forward to increasing my knowledge on all pertinent subject matters.