Machine Learning

Nonfiction | Book | Adult | Published in 1986

Plot Summary

Mitchell presents a comprehensive introduction to the field of machine learning, which he defines as the study of how to build computer programs that automatically improve with experience. The textbook draws on concepts from artificial intelligence, statistics, information theory, computational complexity, neurobiology, and other disciplines to survey the major algorithms, theoretical results, and applications that form the core of this interdisciplinary field.

Mitchell opens by establishing a formal definition of learning: A computer program learns from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T improves with experience E. He illustrates this through successful applications such as the SPHINX speech recognition system, the ALVINN autonomous vehicle, and the TD-GAMMON program, which achieved world-class backgammon play. He frames machine learning as fundamentally a search problem, in which algorithms search through a space of possible hypotheses to find the one that best fits observed data and prior knowledge.

To ground these abstractions, Mitchell walks through the design of a checkers-playing program, introducing core design choices: the task, the performance measure, the source of training experience, the target function and its representation, and the learning algorithm. He introduces the LMS (least mean squares) weight update rule, a form of stochastic gradient descent that iteratively adjusts weights to minimize squared error. The final system comprises four interacting modules: a Performance System that plays using the learned function, a Critic that extracts training examples from game traces, a Generalizer that fits the function to examples, and an Experiment Generator that proposes new games.

Mitchell then addresses concept learning, the problem of inferring a boolean-valued function from labeled examples. He defines a partial ordering over hypotheses, where one hypothesis is more general than another if every instance satisfying the second also satisfies the first, and shows how this ordering enables efficient search. FIND-S begins with the most specific hypothesis and generalizes minimally to cover each positive example but cannot detect inconsistent data or confirm uniqueness. The CANDIDATE-ELIMINATION algorithm maintains the full version space, the set of all hypotheses consistent with the training data, compactly represented by its most general boundary G and most specific boundary S. Mitchell proves the Version Space Representation Theorem, establishing that the version space equals the set of hypotheses between G and S in the partial ordering.

This leads to a fundamental analysis of inductive bias, the minimal set of assumptions that, together with training data, deductively justify a learner's classifications. An unbiased hypothesis space renders a learner unable to classify unseen instances, making bias-free learning futile. Mitchell distinguishes preference biases, which favor certain hypotheses, from restriction biases, which exclude certain hypotheses, arguing that preference biases are generally more desirable because they do not risk excluding the true target function.

Decision tree learning classifies instances by sorting them through a tree structure where each internal node tests an attribute, each branch corresponds to an attribute value, and each leaf assigns a classification. The ID3 algorithm grows trees top-down by selecting the attribute with the highest information gain, the expected reduction in entropy, a measure of impurity in a collection of examples, at each node. Mitchell addresses overfitting, formally defined as the situation where a hypothesis fits training data well but performs worse than an alternative on unseen data, and presents post-pruning strategies as remedies.

Artificial neural networks receive detailed treatment, beginning with perceptrons, threshold linear units that can represent basic logical functions like AND and OR but not XOR. The BACKPROPAGATION algorithm trains multilayer networks by propagating error terms backward from output units through hidden layers, enabled by the sigmoid activation function, a smooth S-shaped curve that is differentiable. Mitchell shows that sufficiently large networks can approximate any function to arbitrary accuracy and demonstrates how the 8x3x8 identity network independently discovers a binary encoding at its hidden layer. Overfitting strategies include weight decay, which penalizes large weights during training, and k-fold cross-validation, which repeatedly splits data into k training and testing folds to monitor generalization performance.

Mitchell presents statistical methods for evaluating hypotheses, including confidence intervals derived from Normal approximations and cross-validation procedures for comparing learning algorithms using the t distribution.

Bayesian learning serves as both a practical method and an analytical framework. Mitchell proves that under uniform priors and noise-free data, every consistent learner outputs a maximum a posteriori (MAP) hypothesis, and that maximum likelihood under Normally distributed noise minimizes the sum of squared errors. The naive Bayes classifier, which assumes conditional independence of attributes given the class, achieved 89% accuracy classifying articles into 20 newsgroups. Bayesian belief networks encode richer conditional independence structures as directed acyclic graphs. The EM algorithm handles learning with unobserved variables by alternating between estimating hidden values and maximizing expected log-likelihood.

Computational learning theory formalizes sample complexity. Mitchell defines PAC (Probably Approximately Correct) learnability, which requires a learner to output a low-error hypothesis with high probability in polynomial time. The Vapnik-Chervonenkis (VC) dimension, the size of the largest instance set that a hypothesis space can shatter by representing every possible classification, provides a complexity measure applicable to infinite spaces. The WEIGHTED-MAJORITY algorithm achieves bounded mistake counts by maintaining and adjusting weights for a pool of prediction algorithms.

Instance-based methods defer generalization until query time. The k-NEAREST NEIGHBOR algorithm classifies by majority vote among stored examples, with distance-weighted variants emphasizing closer neighbors. Locally weighted regression fits local approximations for each query. Radial basis function networks bridge instance-based and neural network approaches using localized Gaussian kernels. Case-based reasoning extends these ideas to rich symbolic representations.

Genetic algorithms maintain a population of hypotheses, selecting the fittest for reproduction via crossover and mutation operators. The Schema Theorem characterizes how patterns propagate through generations proportionally to their relative fitness. Genetic programming evolves computer programs represented as parse trees. The Baldwin effect shows that individual learning can accelerate evolutionary progress even without Lamarckian inheritance, the passing of traits acquired during an organism's lifetime.

Learning sets of first-order rules provides greater expressive power than propositional representations. The FOIL algorithm learns first-order Horn clauses, logical rules consisting of one conclusion and several conditions, using an information-gain measure to select among candidate literals. Mitchell frames induction as the inverse of deduction and presents inverse resolution as a constructive method for generating hypotheses. The PROGOL system combines inverse entailment, a technique that generates hypotheses by inverting what examples logically imply, with bounded general-to-specific search.

Explanation-based learning uses a domain theory, the background knowledge or rules supplied to the learner, and deductive reasoning to generalize from individual examples. The PROLOG-EBG algorithm explains each example using the domain theory, then generalizes by working backward through the proof to identify the most general conditions under which the explanation holds. This analytical approach can discover features not explicit in instance descriptions.

Methods for combining inductive and analytical learning address imperfect prior knowledge. The KBANN algorithm initializes a neural network equivalent to a domain theory and refines it using BACKPROPAGATION. TANGENTPROP incorporates prior knowledge as desired derivatives of the target function, reducing error on handwritten digit recognition. The EBNN algorithm extends TANGENTPROP by automatically extracting training derivatives from a neural-network-based domain theory. The FOCL algorithm augments search with theory-derived candidate specializations, selecting among all candidates by information gain.

The book concludes with reinforcement learning, where an agent learns to maximize cumulative discounted reward through environmental interaction. The Q learning algorithm iteratively updates estimates of the value of each state-action pair based on observed rewards and successor-state values. Mitchell proves convergence for deterministic Markov decision processes (MDPs), models where the next state and reward depend only on the current state and action. The TD(lambda) method generalizes Q learning by blending estimates from different lookahead distances. Using neural networks for function approximation enables scaling to large state spaces, as demonstrated by TD-GAMMON, though convergence guarantees no longer hold.

We’re just getting started

Add this title to our list of requested Study Guides!