Is Machine Learning just glorified curve fitting?
One of the first mathematical concepts we learn in school is addition of two numbers and then build upon that to learn other simple arithmetic operations. Then after a few years comes the concept of algebra and along with it the magical variable “x”, which seems to be able to encode all the unknowns in the universe! So if we are given an equation (eg. 4x+5=10), we need to solve it to figure out the value of “x” that satisfies this equation. At the next level of abstraction comes the concept of a “function” which encodes relationships between two or more variables. If we know how the variable “y” depends on “x” (denoted as y=f(x)), then for any given value of “x”, we can use this relationship to find the value of “y”. Simple examples are trigonometric functions like sin(x) or cos(x).
Evaluating the value of a variable, y, for a known function, f(x), and a value of x is easy. But what if we know the values of y for a few values of x, and need to figure out what the function f(x) is? This is where curve fitting comes in! But for this to work, we need to have a reasonable guess for the form of f(x). So for example, through visual inspection, we may know that y=f(x) is a quadratic function or a higher degree polynomial.
Lets take a quadratic equation for simplicity : y=ax² + bx + c
So now the objective in curve fitting is to figure out the values of these three unknown parameters (a, b and c) using known values of x and y. Now, of course, if we had just one pair of values of x and y, it would not help. We need a large enough number of values of x and y to get this job done. Now how large a number we need depends on how complicated our function f(x) is and how noisy the x and y data is. If our data is perfect and free of errors, then we need just three points to fit a quadratic function (since there are three unknown parameters). But that is seldom the case in practice. For noisy data, there is no simple formula to figure out how much data will be needed, and this is one of the things that makes Data Science an art as much as a science.
Machine Learning (ML) takes this concept of curve fitting to another level altogether! Of course, from a purely mathematical perspective, it is still just a more complicated version of conventional curve fitting, but from a conceptual perspective, there are fundamental differences which ML practitioners need to understand. A small caveat here is that this also depends on which ML method we are talking about. For example, Logistic Regression is one of the most popular ML techniques and widely used for binary classification, but can be considered very close to conventional curve fitting. But more advanced ML techniques like Artificial Neural Networks (ANNs) are significantly different from conventional curve fitting right from the start. So what makes ANNs so special and different?
The most special property of ANNs is that they are universal function approximators. For any given data set consisting of input values and the corresponding output values, theoretically, ANNs can find an approximate function that connects the two. But of course, actually finding this approximate function can be very hard if the data quality is not good enough (eg. highly noisy), or if the function is too complicated (eg. highly nonlinear)! The good news is that over the years, ML folks have developed some amazing techniques to find such approximate functions using ANNs for a wide variety of datasets.
So the challenge in ANN is not to just approximate a single known function, but to find an appropriate function from the space of all possible functions that can best fit the given data. This is what makes ANN very different from conventional curve fitting, and this is also what makes ANNs so magical and ubiquitous in ML!
The second aspect which differentiates ANNs from conventional curve fitting is that the ANNs have to find an approximate function that does well on the test data (which the algorithm has not seen) as well as the training data (which the algorithm uses to find the appropriate function parameter values). This is the concept of generalisation as compared to pure optimisation. In other words, the usefulness of ANN is not just in explaining a given set of data that is currently available, but also to be able to make good predictions on new data that one may come across in the future. In technical terms, this is about avoiding the problem of over-fitting. It is this point that makes even Logistic Regression different from simple curve fitting.
The relationship between ML and curve fitting is perhaps similar to that between a biological system and particle physics. At the fundamental level, its all an interplay of particles, but understanding how biological systems work will require lot more than just the standard model of particle physics.