Unsupervised learning

What is unsupervised learning and how is it implemented? Learn more about it here.

Definition

Unsupervised learning is an ML method in which the algorithm is trained on unlabeled data - i.e. no predefined correct answers are provided. The system must discover structures, patterns or correlations in the input data itself. In contrast to supervised learning, there is no direct right or wrong from the trainer. Unsupervised learning often aims to find hidden clusters, groupings or data distributions or to reduce the dimensionality of the data (for visualization or pre-processing).

Methods and examples

Typical types of unsupervised learning tasks are

  • Clustering: The algorithm groups data points that are similar. Example: A marketing analysis has customer data (age, purchasing behavior, etc.) without labels. Clustering can find out that there are perhaps 3 main customer groups (e.g. "young bargain hunters", "family-oriented buyers", "luxury segment"). Methods: k-means, hierarchical clustering, etc.

  • Dimension reduction: A lot of data has very high dimensionality (e.g. pixel-by-pixel image data). Unsupervised techniques such as PCA (Principal Component Analysis) or t-SNE attempt to distil out the essential variables or map the data in fewer dimensions in such a way that as much information as possible is retained. This helps with visualization or noise removal.

  • Association analysis: Finding rules for which characteristics frequently occur together. Well-known example: Shopping basket analyses in retail ("Customers who bought product A also bought B").

  • Generative models: Models such as autoencoders or GANs learn the distribution of the input data to generate new similar data or detect anomalies (e.g. an autoencoder learns a compressed representation of the data – can be used to detect outliers that cannot be compressed well).

Advantages and challenges

Unsupervised learning is particularly useful when there is little or no labeled data – which is the case in many real-world cases, as label creation can be expensive. It can reveal hidden structures that were not previously thought of. However, evaluation is often more difficult: since there is no clear target, the human has to interpret the discovered structures and evaluate whether they make sense. Two people could interpret different cluster solutions differently. It can also happen that an algorithm finds trivial patterns or patterns that are irrelevant to the question – human expertise is important to validate results.

Use cases

  • Data exploration: Unsupervised learning is used in unknown data sets to get a feel for possible groupings or relationships (e.g. segmentation of user behavior on a website without predefined categories).

  • Anomaly detection: Here, a model implicitly learns what is "normal" (without "normal" being labeled). Anything that deviates significantly from this is marked as an anomaly. For example, in network security (recognizing unusual access patterns) or in quality inspection (recognizing rejects based on unusual sensor profiles).

  • Feature learning: Unsupervised pre-training – neural networks can first learn patterns unsupervised (e.g. autoencoders), which are then used for supervised fine-tuning. In this way, large amounts of unlabeled data are used to give the model a basic structure (pretraining).

  • Bioinformatics: Clustering of gene expression data to discover new cell types or disease groups without knowing in advance how many groups or which ones are present.

Overall, unsupervised learning is a kind of journey of discovery in the data: It can bring surprises to light, but the discovery process is less automatically assessable than in supervised learning. Both types of learning complement each other – unsupervised steps are often integrated into supervised pipelines (e.g. first finding clusters, then labeling them with expert knowledge and then training a supervised model).
Back to the overview