In our last blog, we briefly introduced statistical modelling (SM), which is used by organisations to transform data into business insights before machine learning (ML) comes into the picture. Continuing our ML history blog series, this second article will shade some light on the topic of SM; most precisely how it differentiates itself (if at all) from ML and how can businesses decide which one is better suited to cater for their needs. Both SM and ML are based on statistics. We will start talking about their relation to statistics respectively, and then compare their differences.
Here is the article outline. You can jump to your most interested section?
- ML is not just glorified Statistics
- SM is approximation to reality
- Differences between SM and ML
- How to choose SM and ML for your business?
- Automated and explainable ML
1. ML is not just glorified Statistics
Statistics is a mathematical science which deals with the collection, analysis, interpretation or explanation, and presentation of data. Since ML finds patterns in large amounts of data, it is obvious that ML is built upon a statistical framework.
However, ML draws upon many other fields of mathematics and computer science, for example:
- • ML theory (mathematics & statistics)
- • ML algorithms (optimisation)
- • ML implementations (computer science & engineering)
2. SM is approximation to reality
SM is a simple mathematical function used to approximate reality and optionally to make predictions from this approximation. For example, if we want to prove that the price of a house is related to the square feet of the house, we may use a statistical model (e.g. Y=aX+b) to understand this relationship. We may collect data on 20 houses and test the repeatability of the relationship, so that we can accurately characterise it and make inferences.
3. Differences between SM and ML
The biggest difference between statistics and ML is their purposes. While statistical models are used for finding and explaining the relationships between variables, ML models are built for providing accurate predictions without explicit programming. Although some statistical models can make predictions, the accuracy of these models is usually not the best as they cannot capture complex relationships between data. On the other hand, ML models can provide better predictions, but it is more difficult to understand and explain them.
Statistical models explicitly specify a probabilistic model for the data and identify variables that are usually interpretable and of special interest, such as effects of predictor variables. In addition to identifying relationships between variables, statistical models establish both the scale and significance of the relationship.
By contrast, ML models are more empirical. ML usually does not impose relationships between predictors and outcomes, nor isolate the effect of any single variable. Let’s go back to the house example. Previously we used statistical modelling to understand the relationship between the price and a specific variable-square foot. If we get the data of 20 million houses with 200 features each, and we mainly want to predict house price, we may use a machine learning model (e.g. a neural network) with 200 variables. We may not understand the relationships between variables and how the model makes sense, but what we are after is accurate predictions.
The table below compares some key differences between SM and ML:
4. How to choose SM and ML for your business?
In business world, data analytics requires in-depth understanding of business problems, as well as highly accurate predictions. To achieve desired results, companies need to know which situation requires which model. This depends on the input data (such as data type and data amount), the importance of understanding the relationships between variables, and ultimately on the decisions to be taken.
You can apply SM when:
- You understand specific interaction effects between variables. You have prior knowledge about their relationships, for example, before you analyse weight and height, you know there is positive linear relationship between these two variables.
- Interpretability is important. You have to comply with strict regulations which requires you to understand exactly how the models work, especially when the decision affects a person’s life.
- Your data is small. You can observe and process datasets personally, for example, your data can be accumulated in an Excel file.
For example, hospitals want to identify people at risk of an emergency hospital admission. It is very important to understand the characteristics of patients. These are useful information for designing intervention strategies to improve care outcomes for these patients. For instance, with patient data from 5 Primary Care Trusts within England, the analyst may choose SM to prioritise patients for preventive care.
You can apply ML when:
- High predictive accuracy is your goal. For example, if you work in an insurance company, you don’t want to accept a fraudster’s false claim and pay out in cash, so you want your predictive model on fraudulent claim to be as accurate as possible.
- Interpretability is less important. You do not care much about why a decision was made. Being able to understand the model is ideal, but not a must.
- Your data is big. You won’t be able to process the data in person. For example, individual patient’s charts and complex information about diagnoses, treatments, medications and more.
ML is very good at non-pre-specified interactions. For example, companies usually have a massive customer database with hundreds of variables, without knowing what variables define a certain type of customer. To segment customers into different types for personalised marketing, the model needs to predict segment membership of individuals with high accuracy. In this case, we go for ML model.
5. Automated and Explainable ML
SM has a long history while ML is still evolving. We have talked about ML’s weakness in explainability, and requirements for data pre-processing and an expensive and rare expert team. Today, more advanced ML has managed to tackle these barriers, enabling more real-world ML applications.
TurinTech’s Evolutionary AutoML empowers people with different sets of skills to automatically build accurate and explainable ML models. TurinTech automates the end-to-end ML process, accelerating the whole process to days (usually it takes months), and removing technical entry barriers for citizen data scientists. From marketers, to engineers to data scientists, everyone can build world-expert ML models at ease. Powered by our proprietary research in evolutionary optimisation, models evolve hundreds of times for optimal results based on customised criteria. Both SM and ML models will be created and ranked by user-defined criteria, so users can simply choose the best one they prefer. In addition to our transparent process, TurinTech platform provides easy to understand explanations like why and how models make certain predictions.