Visualizing High Dimensional Data

Much of the data that we deal with live naturally in a high dimensional space. Being humans in a 3-dimensional world, we have difficulty visualizing such data. Effective visualization is often useful in helping us gain insights on the data that we are dealing with. In order to do so, we require tools to reduce the number of dimensions to 1, 2 or 3. Fortunately, many of such tools are already implemented in popular data science packages like scikit-learn, and visualizing these data is often as easy as a fit_transform(data).

# Packages we use for plotting

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

In subsequent code snippets, data is an array, or slice of a pandas DataFrame, df[feature_cols], where each row is an data point and columns are feature dimensions.

A colab notebook for this post is available here.

PCA: Principal Component Analysis

PCA finds the direction where the most variance is observed, set at first direction. Find the next largest variance after removing the first, set as next direction, and repeat this process until desired number of components are obtained. We are often able to stop well below the original number of dimensions, while capturing the majority of the variances in the data.

As a visualization method, PCA is good when the data is already linearly separable. However, it might not be as useful if the data lies on a lower dimension manifold embedded in a high dimensional space. It is also relatively cheap to compute, thus making it a good first thing.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)
df['pca_0'] = pca_result[:, 0]
df['pca_1'] = pca_result[:, 1]
print(f'Explained var: {pca.explained_variance_ratio_}')

plt.figure(figsize=(16,10))
sns.scatterplot(
    x=f'pca_0', y=f'pca_1',
    hue="y",
    palette=sns.color_palette("colorblind", 10),
    data=df,
    legend="full",
    alpha=0.3
)

t-SNE: t-distributed Stochastic Network Embedding

Suppose that our data is inherently low-dimension but lives in a high dimensional space (a rolled up 2D sheet (swiss rolls), a tangled strand of string, are common examples of such cases), PCA and other linear methods would not be an effective visualization.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=300)
tsne_result = tsne.fit_transform(data)
df['tsne_0'] = tsne_result[:, 0]
df['tsne_1'] = tsne_result[:, 1]

plt.figure(figsize=(16,10))
sns.scatterplot(
    x='tsne_0', y='tsne_1',
    hue="y",
    palette=sns.color_palette("colorblind", 10),
    data=df,
    legend="full",
    alpha=0.3
)

t-SNE, however, contains some hyperparameters and not setting them correctly could lead to misreading of the manifold. Here's good interactive post to see how each of these parameters matter and how to avoid certain pitfalls when using t-SNE as a visualization technique. How to Use t-SNE Effectively (distill.pub)

UMAP: Uniform Manifold Approximation and Projection

UMAP is a method that isn't included in scikit-learn. Using it is almost exactly the same as scikit-learn methods.

umap_reducer = umap.UMAP()
umap_result = umap_reducer.fit_transform(data)

df[f'umap_0'] = umap_result[:, 0]
df[f'umap_1'] = umap_result[:, 1]

plt.figure(figsize=(16,10))
sns.scatterplot(
    x=f'umap_0', y=f'umap_1',
    hue="y",
    palette=sns.color_palette("colorblind", 10),
    data=df,
    legend="full",
    alpha=0.3
)

Other Methods

scikit-learn is an amazing package. It includes several other dimension reduction methods with a largely similar API.

Changelog

2021-06-29 Initial version
2021-07-01 Clarity on code, intro and some additional points on PCA

TODO?

Add additional reading as references
Add some useful insights and use cases

Visualizing High Dimensional Data - PCA, t-SNE and UMAP

TeckYian Lim