Key words: Bayesian Network Learning, Training Sample Size, Overfitting, Underfitting
Bayesian networks have often been used to classify data sets for which there are a (relatively) large number of available training sites (e.g., remotely sensed imagery). They have been shown to classify more accurately than Gaussian statistical methods (e.g., maximum-likelihood classifiers), especially when a large proportion of training sites relative to the total number of classification sites and attributes are available. Their utility in solving complex classification problems where the proportion of available training samples is small has not been examined. In order to address this question, this paper examines the shape of the classification accuracy curve as the proportion of training samples decreases, and compares Bayesian network learning (BNL) performance to that achieved with traditional statistics (cluster and discriminant analysis). Several Bayesian networks were constructed to represent the problem of predicting dominant overstory and understory species at a particular field site based on a suite of environmental measurements. Data from the Oregon Woody Plant and Environment Database (OWPED) were used to test the ability of the network to generalize using varying proportions of the total number of observations (n = 2254) as training data. The classification accuracy curves and measures of attribute value variability of each of the networks were quantified and compared to determine whether there are common curve responses to variation in both the proportion of training sites available as well as attribute variability. Knowledge of the shape of this curve helps in assessing the whether BNL is an appropriate technique for exploring a particular data set.