Slides from University about Regression. The Pdf explores regression and supervised learning methods, focusing on Ridge Regression and Lasso, with examples and mathematical formulas. This Computer science document is useful for understanding model evaluation through metrics like MSE, MAE, and R-squared, and addressing overfitting.
See more15 Pages


Unlock the full PDF for free
Sign up to get full access to the document and start transforming it with AI.
It is a setting in which data from an unknown function map an input x to an output t: D = {{x, t)}. Input variables x are usually called features or attributes, output variables t are also called targets or labels. The goal is to find the best possible approximation of f, the mapping function. The function f does not need to be deterministic, it can have statistical uncertainty. Two assumptions that need to be made are that the relationship f is existing for data in the domain D, and that the function must be repeatable over the same dataset in time.
Depending on the nature of t, there will be different forms of supervised learning:
If we imagine of drawing a set F of all the possible functions that go from the input space to the target space, we can assume that there exists only one function f being the best one to approximate. In any problem, we define an input, a target, and a domain of (usually infinite) functions mapping the input into the target. Supervised learning tries to solve an optimization problem, thus approximating a function f given the dataset D.
When defining the hypothesis space, we cannot be sure that it will contain the optimal approximation. h1 is the best approximation in the example in the hypothesis space H (}{II). If the family of functions H, chosen to find an approximation, is larger, then we can hope to have a larger hypothesis space containing the true function f to approximate. If the optimization process is good enough, we will be able to find exactly the mapping function f. Indeed, this is not always the case.
f h1 H f = h2 H2
The true problem of supervised learning is that f is unknown, therefore it is almost F impossible to end up having a "nice" loss function as in the image. This is the key problem of supervised learning, the most important concept behind machine learning.
Let's consider the true function f generating the data in the image. (>)
The true function is given by the red line + some noise, so it can be represented as follows (4)
f × f
What can be considered when trying to approximate the red line are just some data points (the circles are 4 data points).
1If we choose a family of linear approximation functions (very easy), the best thing we can learn is something like in the image, which is very different from the original function f.
The real problem when having just these four points is that we have no clue on how to compute the approximation function h1. That's because we have no way of understanding how far h1 is from f in other regions of the hypothesis space, where we have no points. Hence, we cannot improve our approximation function.
h2 f h h2 does not correspond to the optimal value of the loss function. Sometimes, a restriction in the hypothesis space can lead to better solutions as larger mistakes can be avoided thanks to that restriction.
-- f 0 h1
We can increase the size of the hypothesis space, moving to a quadratic model. We obtain h2. From the point of view of the 4 known points, this curve seems perfectly fit, however by comparing the errors we make with respect to the true function, the error made by h1 is better with respect to the red line.
h2 h3 H3 f H2 F
Linear regression model is not as limiting as it may appear and is not used only in the simplest cases.
In the regression problem, we have some data consisting of input variables (e.g., x) and a target variable (e.g., y). The aim is to find a model to predict the value of the target variable based on the input variables.
In the simple setting reported (->), the easiest way to predict the variable y is the average of y itself. We use basic average as a predictor. It is a very simple baseline, but it represents the simplest possible model and a value to evaluate other models. Since it is the baseline, we expect that all other models are better.
Simple linear model is basically a line that try to minimize the errors while fitting all the data points. Note that we are not only training the red line to find the relationship between x and y, but also to apply this model on a new set of data point. It is likely that on the new set of data, the model we'll make some mistakes. Still, the aim of training is to minimize such error when we apply the model to different datasets.
125 100 75 50 25 0 -25 -50 0 2 * The training data points 125 100 75 50 25 0 -25 -50 -2 -1 1 2 M y=34.574x+53.176 125 100 75 50 25 0 -25 -50 -2 -1 0 1 2 x W1 Wo 7 x
We need to quantify in an objective way the error made in order to minimize it. In general, mistakes are represented with loss functions. The most convenient is the residual sum of squares.
RSS(wo,W1)=>€ ?= >(ti-(wo+w1X1)2 i=1 N i=1 N L = 2.8 . 105 · L = 3.3 . 105 ·- L = 7.6 . 103 W1 · L = 5.4.104 L = 1.0 . 105 L = 1.0 . 106 . wo 7 X
From a mathematical perspective, the RSS is a quadratic loss function. Thus, in order to find the optimum value of W0 and W1 we look for a minimum in the equation. Indeed, we compute the derivative with respect to each weight.
VRSS(wo, w1)= -2Ei-1 (ti-(wo + w1;) -2 Ei-1 (ti- (wo + wifi) Ti RSS
4* W Since it is a multivariate function (depends on both w0 and w1), the derivative procedure will give us two functions. It is called gradient and corresponds to the vector containing all the partial derivatives of a function. Then, we have two possible approaches:
VRSS RSS Initial weight -- w* W
The simplest linear model can be defined to expand its dimensional space. We use the same approach as before having D variables. The first of all variables is equal to 1, thus in the input vector we will have 1 followed by D-1 variables. This is because the final model is the scalar product of our input variable vector and weights. Since the first weight W0 is a baseline, we multiply it by 1 (the first of all variables) in order to not change its value.
y(x, w) =w0+ D-1 j=1 > wjaj = wTx x=(1,x1, ... ,xp-1) Wo is the bias parameter associated with dummy variable 1 . Note that the T above the w means transposed: transposition is necessary for the scalar product (it must be a row multiplied by a column product).
Even though we can generalize for multiple variables, we still have the problem of linearity. The model showed before is constrained in linearity, thus higher order relationship cannot be well approximated.
A linear combination of the input variables is not enough to model data, however we don't need input variables to be linear as far as a regression model that is linear in the parameters. Indeed, we can increase the order of the input variables, but we can still have a scalar product between weights and input. To increase the order, we are applying to the input vector a feature mapping ¢.
We can define a model using non-linear basis functions: y(x, w) = w0+ E j=1 M-1 wjØj(x)=w™¢(x) +(x)=(1,01(x), ... ,¢M-1(x))7
In this case, we are adding a new variable called x2, the feature coming from the square of the variable x. If we combine the two variables, we can identify a new linear model using x and x2 in a higher dimensional space (from 1D to 2D). Note that the new variable may be more significant than the original one, thus in the end we will may end up with a feature space of the same of dimension of the input space or maybe both variables are significant and the dimensional space is higher. In the end, the solution in the original input space will appear as in the yellow balloon.
2 x 7 X X t . 7 x x2 x2