Machine Learning - Model Training Tips

You don't really need to know the implementation details while building & configuring a machine learning model. However, a good understanding of how thing works can help you identify the appropriate model, the right training algorithm to use and a good set of hyperparameters for your task.

Bearing in mind that training a model means searching for a combination of model parameters that minimize the cost function (i.e the error between the model output and the training set).

Such understanding will help with model debugging issues and efficient error analysis.

The linear regression is a linear model that can be trained mainly in two ways:

closes-form equation: directly computes the model parameters that best fit the model reducing the cost function over the training set
Iterative optimization approach: Gradient Descent (GD) gradually tweak the model parameters to minimize the cost function over the training set eventually converging to a minimum. It comes also with variants: Batch GD, mini-batch GD, Stochastic GD.

But what if your data is actually more complex than a simple straight line? The polynomial regression is a more complex model that fits non-linear datasets. However, having more parameters than the linear regression makes it prone to overfitting the training data set.

In this part, we will list attention points and tips to have in mind when dealing with common core modeling techniques:

prediction use cases: linear regression, polynomial regression
classification use cases: logistic regression, softmax regression

Linear Regression tips:

Normal Equation

Cost function: Even if the most common performance measure for a regression model is the Root Mean Square Error (RMSE), it is simpler to minimize the Mean Square Error (MSE) than the RMSE, and it leads to the same result.
Normal Equation (closed form): It's a mathematical equation to gives directly the parameters that minimize the cost function.
1. Cons - Computational Complexity: The normal equation creates computational complexity since it relies on the computation of an 'n x n' matrix (n being the number of features) and turns out to be very slow. Making it very slow when 'n' grows large.
2. Pros - Linear: The equation is linear with regards to a number of instances in the training set. It handles training sets efficiently, provided they can fit in memory.

Gradient Descent

Use case: Better suited for cases where there are a large number of features, or too many training instances to fit in memory.
Learning rate: it's an important parameter to set carefully, since:
- a low value will likely make the algorithm take a long time to converge
- a high value will make the algorithm diverge, failing to find a good solution
Cost function: Not all cost functions will have a single minimum (bowl shape), making the algorithm fall in a local minimum instead of the global one.
Parameters Random initialization: This is a common technique to initiate the gradient descent algorithm, however, it can lead to different outcomes (resulting minima) when the cost function doesn't have a single minimum.
MSE Cost Function: Fortunately, the MSE cost function for a Linear Regression model happen to be a convex function, which means it has no local minima, just a global one. Eventually, it is guaranteed to approach close to the global minimum.
Feature scaling: When using GD, you should ensure that all features have a similar scale (i.e using scikit-learns' StandardScaler), or else it will take much longer to converge.
Implementation variants:
- Batch GD: Instead of computing partial derivates for each parameter, this approach uses the gradient vector that computes them all in one go for all the parameters. It involves the whole batch of data at every step, which makes it terribly slow on very large training sets. However, compared to the Normal Equation, it remains much faster to train a Linear Regression model when there are hundreds of thousands of features.
- Stochastic GD: Oppositely from batch GD, it takes only a random instance at every learning step and computes the gradient-based only on that single instance. Hence, making it much faster and able to perform against a very large training set. However, once the algorithm stops, the final parameter values are generally good but not optimal. Since the cost function ends up being very irregular, the stochastic GD has a better chance of finding the global minimum than Batch GD does. But it also means that the algorithm can never settle at the minimum. One solution to this is to gradually reduce the learning rate, a process called 'simulated annealing', which leverages a function called learning schedule.
- Mini-batch GD: It combines the best of both worlds (Batch and Stochastic), computing gradients on small random sets of the training set called mini-batches. The algorithm progress is less erratic than SGD, finally walking a bit closer to the minimum than SGD. However, it may be harder for it to escape from the local minimum (unlike Linear Regression).
Convergence accuracy: Batch GD actually stops at the minimum, while SGD and Mini-batch GD end up near the minimum. However, Batch GD takes a lot of time to take each step, SGD and mini-batch GD would also reach the minimum when a good learning schedule is used.

Polynomial Regression tips:

Scikit-Learn's PolynomialFeatures(degree=d) class can transform training data set adding combinations of higher degree, which makes it capable of finding relationships between features. PolynomialFeatures with degree=3 would not only add the features a^2, a^3, b^2, b^3, but also the combinations ab, a^2b, and ab^2. So beware of the combinatorial explosion of the number of features.
Learning curves help assess how far a model is too complex or too simple, eventually identifying if it is overfitting or underfitting the training dataset. If the model is underfitting the dataset, adding more training examples will not help. You need to use a more complex model (higher degree polynomial) or come up with better features.
Increasing a model's complexity will typically increase its variance and reduce its bias. Conversely, reducing a model's complexity increases its bias and reduces its variance.
Regularization is a technique used to reducing overfitting. Ridge Regression, Lasso Regression, and ElasticNet are different ways to constrain (regularize) a polynomial model.
Ridge regression adds a regularization term to the cost function which forces the learning algorithm to avoid fitting the data while keeping the model weights as small as possible.
Lasso regression automatically performs feature selection and outputs a sparse model (with few non-zero feature weights). It tends to completely eliminate the weights of the least important features.
ElasticNet is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both.
Ridge is a good default regularization technique, but if you suspect that only a few features are actually useful, then you should prefer either Lasso or ElasticNet.
ElacticNet is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.
Early Stopping is another way to regularize iterative learning algorithms such as Gradient Descent, since we stop the learning process as soon as the validation error (i.e RMSE) reaches a minimum. However, with Stochastic and Mini-batch GD, the curves are not so smooth, and it may be hard to know whether you have reached the minimum or not.

Classification tips:

Logistic Regression commonly used for classification as it estimates the probability that an instance belongs to a particular class (i.e how likely an email is a spam ?). It relies on a sigmoid function (S shaped) that outputs a number between 0 and 1. Closer to 0 (less likely), Closer to 1 (most likely).
Bad news: There is no known closed-form equation to compute the value of parameters that minimize the logistic regression cost function.
Good news: The cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to fin the global minimum if the learning rate is not too large and you wait long enough.
The Softmax Regression, or Multinominal Logistic Regression, is a generalization of the Logistic Regression to support multiple classes directly.
The Softmax Regression classifier predicts only one class at a time, so it should be used only with mutually exclusives classes such as different types of plants. You cannot use it to recognize multiple people in one picture.