In Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society. Michael Shafto and Pat Langley (Eds.). Mahwah, New Jersey: Lawrence Erlbaum, 1997, pp. 161-166.

Recent Work in Computational Scientific Discovery

Lindley Darden (darden at umd.edu)
Committee on the History and Philosophy of Science, Department of Philosophy
University of Maryland, College Park, MD 20742 USA

Abstract

This paper reviews work in computational scientific discovery. After a brief discussion of its history, the focus will be on work since 1990. The second half of the paper discusses the author's use of three methods for studying reasoning strategies in scientific change: historical-philosophical vs. live-in-the-lab vs. computational, pointing out advantages and disadvantages of the computational method.

Introduction

There are a number of approaches to the study of reasoning in scientific discovery. In addition to computational approaches, work continues in cognitive science (e.g., Schunn & Dunbar, 1996), in laboratory studies (e.g., Darden & Cook, 1994; Dunbar, 1995) and in philosophy of science (e.g., Bechtel & Richardson, 1993; Darden, 1991; Kleiner, 1993; Nersessian, 1992; Nickles, 1994; Schaffner, 1993; Spirtes, Glymour & Scheines, 1993). Unfortunately, of the over 200 papers and abstracts submitted for the Philosophy of Science Association meeting in 1996, none were on the topic of reasoning in scientific discovery (Darden, Ed., 1996; 1997). Most philosophers of science do not view discovery as a central topic in the field, despite continuing work by those of us called "friends of discovery" (Nickles, Ed. 1980). It is encouraging that the Cognitive Science Society is sponsoring this Symposium on Scientific Discovery.

This paper will briefly review the history of computational scientific discovery that uses methods from artificial intelligence. (Non-cognitive, non-AI computational work is outside the scope of this paper.) The first part of the paper will concentrate on the work since 1990 (Shrager & Langley, Eds.). The extensive reference list provides a guide for further reading. The second half of the paper will compare three methods used in my own work on reasoning strategies in scientific change. Finally, I will point out advantages and disadvantages of the computational approach from my perspective as a philosopher of science.

Pioneering Work

The study of computational scientific discovery emerged from the view that science is a problem solving activity, that heuristics for problem solving can be applied to the study of scientific discovery in either historical or contemporary cases, and that methods in artificial intelligence provide techniques for building computational systems. Pioneers in this work are Bruce Buchanan (e.g., 1982) and Herbert Simon (e.g., 1977). Buchanan was trained as a philosopher of science at a time when the profession was dominated by Popper's (1965) view that there is no logic of discovery. Buchanan stated the new research program:

"The traditional problem of finding an effective method for formulating true hypotheses that best explain phenomena has been transformed into finding heuristic methods that generate plausible explanations. The problem of giving rules for producing true scientific statements has been replaced by the problem of finding efficient heuristic rules for culling the reasonable candidates for an explanation from an appropriate set of possible candidates" [and finding methods for constructing the candidates] (Buchanan 1985, 110-111).

Discovery as heuristic search in a search space enabled AI methods to be applied to discovery tasks.

The first expert system, DENDRAL, was a scientific discovery system. It formed hypotheses about chemical compounds, given mass-spectrographic data (Lindsay, Buchanan, Feigenbaum, & Lederberg, 1980;1993). This was followed by Meta-DENDRAL, which discovered new rules in mass spectrographic analysis, so as to by-pass the problem of getting rules from experts (Buchanan & Feigenbaum, 1978). Although its original algorithm was a computational realization of Lederberg's systematic scan strategy (Lederberg, 1965), DENDRAL was built to carry out a contemporary, difficult scientific task rather than as a model of human cognition.

A more historical-cognitive approach was the aim of the work on BACON, which rediscovered various scientific laws by finding patterns in numerical data (Langley, Simon, Bradshaw & Zytkow, 1987). Simon's early work on finding patterns in sequences (Simon & Kotovsky, 1963) was extended in BACON to heuristic search for patterns in numerical data. The most creative of BACON's abilities was the decomposition of relational data to conjecture intrinsic properties in one or more of the objects engaging in the relations. This step went beyond curve-fitting and was based on the metaphysical assumption that an entity's relational properties are caused by its intrinsic properties. In addition to the data-driven tasks modeled in BACON, the group also investigated theory-driven discovery in STAHL. One wonders to what extent these programs model actual cognitive processes of historical scientists, as opposed to finding strategies which are sufficient to reproduce the historical results. As with most simulations, they provide "how possibly" accounts. Using studies of notebook evidence, the KEKADA system (Kulkari & Simon, 1988) modeled reasoning patterns in some discoveries of the biochemist Hans Krebs and focused on responses to surprising experimental results, helping to dispel the mystery of serendipity in discovery.

A seminal conference on computational methods for scientific discovery, whose proceedings were published in 1990 (Shrager & Langley, Eds.) is a useful source for the state to the field at that time.

Recent Work

Some of the pioneers in scientific discovery, e.g., Buchanan, Simon, and Zytkow, push ahead with their research programs. Others who contributed to the 1990 volume are still working on discovery. The American Association for Artificial Intelligence sponsored a Spring Symposium on Systematic Methods of Scientific Discovery in March, 1995. A special issue of Artificial Intelligenceon computational discovery is about to appear, although fewer papers were received than the editors wished (Simon, Valdes-Perez & Sleeman, forthcoming). Data-mining in scientific databases is an active area of research, as are other computational approaches applied to individual sciences, e.g., intelligent systems in molecular biology. It is becoming more difficult to locate computational discovery work because much of it is published in scientific journals--a good sign that the methods of producing results of interest to practicing scientists.

Buchanan (e.g., Lee et al., 1996) continues work on rule induction applied to various scientific databases. Simon is studying the difficult problems of constructing diagrammatic representations (Larkin & Simon, 1987; Qin & Simon, 1995) and of modeling relations between diagrammatic and verbal reasoning (Tabachneck-Schijf, Leonardo, & Simon, 1996). Zytkow continues to work on various aspects of discovery, including analyzing the components needed for an autonomous discovery agent (e.g., Zytkow, 1995/96) and knowledge discovery in databases (e.g., Zytkow & Zembowicz, 1996).

Much of the current work in computational discovery is occurring within applications to particular sciences. According to Peter Karp, the whole field of bioinformatics is doing computational scientific discovery but there is a gradient from computational discoveries that are not based on AI methods, to computational discoveries that are based on AI methods, to methods with a "cognitive flavor." Not much of the bioinformatics work falls into the last category. However, Karp (et al., 1996) applied reasoning by analogy to predict metabolic pathways in the bacterium, H. influenzae,based on the extensive knowledge base that he and Monica Riley, a bacterial geneticist, have developed for E. coli.

Larry Hunter, a frequent editor of publications in AI and molecular biology (e.g., Hunter 1993), recently informed me that there is a clear success is the application of AI technology to molecular biology: hidden Markov models (HMMs) for molecular sequence analysis. They are being applied to automatically build models of families of nucleotide and amino acid sequences. These models are useful as extremely sensitive classifiers of novel sequences, and also generate multiple sequence alignments of large numbers of sequences in a computationally efficient way. Tools based on this approach are now in wide use in the biological community. A review article is Eddy (1996). Also, AI-based qualitative reasoning technologies have produced several good applications in reasoning about metabolism. Perhaps somewhat surprising is that the work in intelligent systems in molecular biology, for the most part, does not employ discovery methods discussed at the Shrager and Langley (Eds. 1990) conference.

The extensive protein sequence database has provided a challenge for those seeking to find computational methods to predict how the linear amino acids will fold into the secondary and tertiary structures in proteins. The Human Genome Project, which is rapidly producing millions of bases of sequence information about both human and model organism genomes, presents a challenge for computational approaches. Good programs are needed for discovering genes, both coding regions and regulatory regions, in these linear sequences. Current programs are not good at finding introns, intervening sequences between the coding regions of genes. Since the genetic system has some means of detecting introns, one can expect computational systems to be able to discover the signal(s). Knowledge discovery in scientific databases (e.g., Fayyad, Haussler & Stolorz, 1996) promises to be an important area in coming years.

Raul Valdes-Perez's (1994) work in chemistry shows the power of computational systems in doing a systematic search of a hypothesis space, given certain constraints. MECHEM is able to find reaction pathways that chemists have missed.

Buchanan's work on rule discovery in scientific databases and Valdes-Perez's work on systematically conjecturing chemical reaction pathways illustrate the power of design AI systems that aim, not at realistically modeling human cognitive capacities, but using computational methods to circumvent human limitations. Humans are not good at searching massive databases and manipulating sets of rules with many features to make predictions. Cognitive science research has shown that humans have a tendency to focus too rapidly on one hypothesis before doing a systematic search of a hypothesis space. Discovery programs that are more systematic and more thorough than humans are an aid to scientists.

Computational Discovery: Pros and Cons

My own work on reasoning in scientific change focuses on an cyclic process: discovery, assessment, revision. Given a good revision procedure, one's discovery methods can be weaker. Strategies for these processes include: strategies for producing new ideas, e.g., analogies, abstraction instantiation, interfield relations; strategies for theory assessment, e.g., prediction-testing, relations to theories in other fields; and strategies for anomaly resolution (Darden 1991, Ch. 15). After extensive historical study of the development of Mendelian genetics, I proposed hypothetical strategies of the three types. The historical evidence was inadequate to show that they are descriptive cognitive strategies actually used by geneticists. Instead, they are hypothetical strategies that couldhave been used in the historical development of the theory of the gene to produce the changes that didoccur (Darden, 1991). One needs to show that these strategies are effectiveproblem-solving strategies, instances of useful "compiled hindsight" (Darden, 1987), applicable to additional cases, worthy of being used by contemporary scientists or to build AI discovery systems.

I visited in Joshua Lederberg's Laboratory for Molecular Genetics and Informatics and participated in episodes of anomaly resolution that exemplified some of the revision strategies I had proposed (Darden & Cook 1994). One difficulty with the live-in-the lab approach is that little may happen while you are there; fortunately, I was able to observe some anomaly resolution strategies in use. Although I have attempted to implement some of the strategies in AI programs in order to demonstrate their efficacy (e.g., Darden & Rada, 1988; Kettler & Darden 1993; Darden, 1997), I have returned to historical-philosophical work, testing whether strategies from the Mendelian case apply to molecular biology (Darden, 1995).

Computational discovery work has advantages and disadvantages. Finding an adequate knowledge representation for a scientific case is difficult. Early work attempted to represent the relations between genes and chromosomes in part-whole hierarchies and to implement reasoning via inheritance and upward propagation of properties (Darden & Rada, 1988). A much more fruitful method for knowledge representation in genetics was the functional representation (Josephsons, Eds., 1994) for genetic processes (Darden 1997). Furthermore, when one is designing a computational system to rediscover a historical hypothesis, one must navigate between designing a system that trivially reproduces exactly what one is seeking versus designing a system that is unable to accomplish the task at all. Analogy systems often suffer these problems: either the analog is represented in such a way that the system easily finds it or there are so many analogs that the task becomes impossible (for attempts to navigate between these problems, see Kettler & Darden, 1993; Holyoak & Thagard, 1995).

An advantage of computational methods is the precision and completeness that is required to build a working system. The philosopher-historian may neglect aspects that the programmer must specify in detail if the system is to run. A computational approach forces one to reexamine aspects that may be otherwise neglected. However, this advantage is purchased at the price of much time and effort to implement even small parts of a historical case. Various aspects of human discovery, such as the use of pictorial models (e.g., the beads on a string model for genes on chromosomes), provide substantial difficulties when designing an implementation. On the plus side, once one has invested the effort in building a running system, then there is the fun of running experiments, doing "what-if" analyses, testing alternative strategies.

The approach in our TRANSGENE system (Darden, Moberg, Thadani & Josephson, 1992; Darden, 1997) was also used by Karp (1990) in his GENSIM and HYPGEN systems and points to a fruitful way to design a computational discovery system. A qualitative simulator of biological (or other) processes is built and used to make predictions. Data is supplied to test the predictions and another component of the system compares the prediction with data, detects anomalies, and uses diagnosis/redesign strategies to localize the fault in the simulator and redesign a module to remove the anomaly. Perhaps this architecture may be of use in building future AI systems or perhaps more traditional simulation models might be coupled with a revision system to do diagnosis/redesign for anomaly resolution and model improvement.

It will be exciting to see what computational scientific discovery produces in the coming years.