Comparing Methods of Machine Learning
Data analysts working in an ever-growing number of fields have access to big datasets. But the massive quantity of observations doesn’t guarantee analysts can understand the important structures of the data. The datasets might have many variables, or dimensions, that are relevant to the analysis. As a result, analysts quickly run into the curse of dimensionality: every additional dimension of data increases its sparseness, making it harder and harder to find compelling patterns. Dimension reduction techniques, as the moniker suggests, counter the curse of dimensionality by transforming the data into lower-dimensional representations that are less sparse than the original high-dimensional data while maintaining the important variation of the original data.
To understand what’s going on here, check out the video below. In the real world, the object is a 3-D arrangement of pearls; but when we watch the video on our screen, we see that 3-D image projected as a 2-D image. That is dimension reduction. Note also that, as the object rotates, there is no obvious pattern in most configurations. Some configurations are not useful because the points cluster too closely; sometimes one pearl will even be hidden behind another. That is why many machine learning algorithms attempt to maximize variance: they’re trying to avoid these dense clusters that conceal information. Instead, we want to uncover the high-variation configurations that allow us to detect meaningful patterns.
Keep watching until the end. Eventually, the rotation is just right and suddenly, what had been a meaningless jumble of pearls clicks into place as a clearly recognizable shape: the big dipper constellation. Order from chaos. This process of pattern discovery through the process of dimension reduction — and finding the “best possible” dimension reduction — is key to the science of machine learning.
There are various ways of accomplishing the goal of dimension reduction. The most computationally straightforward method is Principle Component Analysis (PCA), which uses linear transformations to find dimensions that best capture the variation in the data. This method performs best when the data is linear; as the data becomes increasingly nonlinear and complex, PCA loses its usefulness because the linear assumptions it relies on become less descriptive of the actual data space.
If the data is nonlinear–often the case in high dimensional real-world datasets–we need to move past PCA to methods that can fit the model to a more complex data structure: a nonlinear manifold. A nonlinear manifold is a complex shape that may be twisted and contorted — but still has smooth edges at all points (in mathematical terms, it is in all places differentiable). In other words, it’s something like a sheet of rubber, which might be extremely crumpled and deformed, but will never have a sharp point. Of the methods that can accommodate nonlinear contexts, Locally Linear Embedding (LLE) is probably the simplest modification to PCA, in that it retains PCA’s linear assumptions. Nevertheless, it can model nonlinear data by restricting the scope of its analysis to local areas of the data. Although the manifold may be nonlinear, as we zoom into more and more local areas, the space becomes more and more linear. That local chunking allows LLE to reconstruct a nonlinear manifold using linear methods.
However, LLE is not perfect. It often fails to satisfactorily separate the data, because it is designed to maximize covariance, even at the expense of structure. For instance, the covariance-maximizing arrangement could be to place many points in the center of the map, with a relatively small number of high-covariance points driving the maximization function. This can produce maps that appear squashed or folded around the origin, which makes it difficult to view the underlying structural separations in the data.
Other approaches, such as self-organizing maps (SOM) and autoencoders (AE), involve a change in concepts, from linear projection of points to a neural approach in which changes in weights and neuron biases drive variation. This is attractive because in contrast to hard partitioning algorithms, which can conceal uncertainty, neural techniques allow for smooth learning. In linear cases, these techniques essentially replicate PCA; but they also have the generalizability needed for nonlinear scenarios. Among the drawbacks of SOM (and other neural methods) are that they begin with randomness, which means that the clustering may vary between runs of the code. In addition, because of the numbers of weights and biases being calculated, these algorithms are more computationally intensive.
Another downside of these neural methods is that they can be difficult to interpret; even if they produce separation, it is not always clear how the machine identified that structure. In contrast, t-distributed Stochastic Neighbor Embedding (t-SNE), and its non-probabilistic counterpart Uniform Manifold Approximation and Projection (UMAP), are non-linear dimension reduction methods that output more intuitive representations of the data. These techniques use attractive and repulsive forces to group the data. They are sensitive both to local and global variation, resulting in visual representations that capture the differences and similarities in the data in a way that feels more human-readable. UMAP in particular excels at both creating meaningful clusters and showing similar clusters close together.
There is no easy way to know which of these approaches will be best through preliminary data exploration. Of course, there are some rules of thumb: PCA is unlikely to be useful in high-dimensional settings; with high-dimensional data in contexts where high computational loads are unacceptable, neural approaches may not be appropriate. But beyond that, it’s not easy to predict what the most useful approach will be.
However, by testing multiple methods and comparing the outputs, it will become clearer which method(s) are the most appropriate to the given application. In some cases, as with LLE and AE, reconstruction error can help us quantify how well the dimension reduction has preserved the information of the higher-dimensional data. In addition, supervised methods can help us evaluate the predictive power of an unsupervised model. But the reality is that there is often not a truly objective way to judge which method is “best.” In these cases, analysts should test several potential methods on their test data and evaluate the relative strengths and robustness of the approach for the application at hand.