Visualizing High Dimensional Data - PCA, t-SNE and UMAP
Much of the data that we deal with live naturally in a high dimensional space. Being humans in a 3-dimensional world, we have difficulty visualizing such data. Effective visualization is often useful in helping us gain insights on the data that we are dealing with. In order to do so, we require tools to reduce the number of dimensions to 1, 2 or 3. Fortunately, many of such tools are already implemented in popular data science packages like scikit-learn, and visualizing these data is often as easy as a fit_transform(data)
.
# Packages we use for plotting
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
In subsequent code snippets, data
is an array, or slice of a pandas DataFrame, df[feature_cols]
, where each row is an data point and columns are feature dimensions.
A colab notebook for this post is available here.
PCA: Principal Component Analysis
PCA finds the direction where the most variance is observed, set at first direction. Find the next largest variance after removing the first, set as next direction, and repeat this process until desired number of components are obtained. We are often able to stop well below the original number of dimensions, while capturing the majority of the variances in the data.
As a visualization method, PCA is good when the data is already linearly separable. However, it might not be as useful if the data lies on a lower dimension manifold embedded in a high dimensional space. It is also relatively cheap to compute, thus making it a good first thing.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data)
df['pca_0'] = pca_result[:, 0]
df['pca_1'] = pca_result[:, 1]
print(f'Explained var: {pca.explained_variance_ratio_}')
plt.figure(figsize=(16,10))
sns.scatterplot(
x=f'pca_0', y=f'pca_1',
hue="y",
palette=sns.color_palette("colorblind", 10),
data=df,
legend="full",
alpha=0.3
)
t-SNE: t-distributed Stochastic Network Embedding
Suppose that our data is inherently low-dimension but lives in a high dimensional space (a rolled up 2D sheet (swiss rolls), a tangled strand of string, are common examples of such cases), PCA and other linear methods would not be an effective visualization.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=300)
tsne_result = tsne.fit_transform(data)
df['tsne_0'] = tsne_result[:, 0]
df['tsne_1'] = tsne_result[:, 1]
plt.figure(figsize=(16,10))
sns.scatterplot(
x='tsne_0', y='tsne_1',
hue="y",
palette=sns.color_palette("colorblind", 10),
data=df,
legend="full",
alpha=0.3
)
t-SNE, however, contains some hyperparameters and not setting them correctly could lead to misreading of the manifold. Here's good interactive post to see how each of these parameters matter and how to avoid certain pitfalls when using t-SNE as a visualization technique. How to Use t-SNE Effectively (distill.pub)
UMAP: Uniform Manifold Approximation and Projection
UMAP is a method that isn't included in scikit-learn. Using it is almost exactly the same as scikit-learn methods.
umap_reducer = umap.UMAP()
umap_result = umap_reducer.fit_transform(data)
df[f'umap_0'] = umap_result[:, 0]
df[f'umap_1'] = umap_result[:, 1]
plt.figure(figsize=(16,10))
sns.scatterplot(
x=f'umap_0', y=f'umap_1',
hue="y",
palette=sns.color_palette("colorblind", 10),
data=df,
legend="full",
alpha=0.3
)
Other Methods
scikit-learn is an amazing package. It includes several other dimension reduction methods with a largely similar API.
Changelog
- 2021-06-29 Initial version
- 2021-07-01 Clarity on code, intro and some additional points on PCA
TODO?
- Add additional reading as references
- Add some useful insights and use cases