Saturday, September 15, 2007

Toward automated discovery in the biological sciences.

Publication: AI Magazine
Publication Date: 22-MAR-04
Format: Online - approximately 11151 words
Delivery: Immediate Online Access
Author: Buchanan, Bruce G. ; Livingston, Gary R.

Article Excerpt
The biological sciences are rich with observational and experimental data characterized by symbolic descriptions of organisms and processes and their parts as well as numeric data from high-throughput experiments. The complexity of the data and the underlying mechanisms argue for providing computer assistance to biologists. Initially, computational methods for investigations of relationships in biological data were statistical (Sokal and Rohlf 1969). However, when the DENDRAL project demonstrated that AI methods could be used successfully for hypothesis formation in chemistry (Buchanan and Feigenbaum 1978; Buchanan, Sutherland, and Feigenbaum 1969), it was natural to ask whether AI methods would also be successful in the biological sciences. (1)

Data in the biological sciences have been growing dramatically, and much of the computational effort has been on organizing flexible, open-ended databases that can make the data available to scientists. After the initial demonstrations of the power of applying machine learning to biological databases (Harris, Hunter, and States 1992; Qian and Sejnowski 1988), the application of machine learning to biological databases has increased. It is now possible to carry out large-scale machine learning and data mining from biological databases. Catalysts for this research were the Intelligent Systems in Molecular Biology conferences, the first of which was held in 1993. This conference brought together people from diverse groups, all with the realization that biological problems were large and important and that there was a need for heuristic methods able to reason with symbolic information.

Toward Automated Discovery

The end point of scientific discovery is a concept or hypothesis that is interesting and new (Buchanan 1966). Insofar as there is a distinction at all between discovery and hypothesis formation, discovery is often described as more opportunistic search in a less well-defined space, leading to a psychological element of surprise. The earliest demonstration of self-directed, opportunistic discovery was Doug Lenat's program, AM (Lenat 1982). It was a successful demonstration of AI methods for discovery in a formal domain characterized by axioms (set theory) or rules (games). AM used an agenda-based framework and heuristics to evaluate existing concepts and then create new concepts from the existing concepts. It continued creating and examining concepts until the "interestingness" of operating on new or existing concepts (determined using some of AM'S heuristics) dropped below a threshold. Although some generalization and follow-up research with AM was performed (Lenat 1983), this research was limited to discovery in axiomatic domains (Haase 1990; Shen 1990; Sims 1987).

Our long-range goal is to develop an autonomous discovery system for discovery in empirical domains, namely, a program that peruses large collections of data to find hypotheses that are interesting enough to warrant the expenditure of laboratory resources and subsequent publication. Even longer range, we envision a scientific discovery system to be the generator of plausible hypotheses for a completely automated science laboratory in which the hypotheses can be verified experimentally by a robot that plans and executes new experiments, interprets their results, and maintains careful laboratory records with the new data.

Currently, machine learning and knowledge discovery systems require manual intervention to adjust one or more parameters, inspect hypotheses to identify interesting ones, and plan and execute new experiments. The more autonomous a discovery system becomes, the more it can save time, eliminate human error, follow multiple discovery strategies, and examine orders-of-magnitude more hypotheses in the search for interesting discoveries (Zytkow 1993).

AI research on experimental planning systems has produced numerous successful techniques that can be used in an automated laboratory. For example, Dan Hennessy has developed an experiment planner for the protein crystallization problem discussed later that uses a combination of Bayesian and case-based reasoning (Hennessy et al. 2000). Because the number of possibly interesting discoveries to be made in any large collection of data is open ended, a program needs strong heuristics to guide the selection of lines of investigation.

No published system completely combines all phases of the empirical discovery process, although planning systems for knowledge discovery in databases (KDD), such as the frame work presented in Engels (1996), perform sequences of tasks for a discovery goal provided by a user. Similarly, multistrategy systems such as that developed by Klosgen (1996) perform multiple discovery operations, but again, the discovery goals are provided by a user, as is evaluation of the discovered patterns. The research presented here describes and evaluates an agenda- and justification-based framework for autonomous discovery, coupled with heuristics for deciding which of many tasks are most likely to lead to interesting discoveries.

A Framework for Discovery

It is essential that a discovery program be able to reason about its priorities because there are many lines of investigation that it could pursue at any time and many considerations in its selection of one. Keeping an explicit agenda allows examination of the open tasks, and keeping explicit reasons why each task is interesting allows comparing relative levels of interest. We use an agenda- and justification-based framework, which is similar to the framework of the AM and EURISKO programs (Lenat 1983, 1982): It consists of an agenda of tasks prioritized by their plausibility. As in AM, a task on the agenda can be a call to a hypothesis generator to produce more hypotheses or explore some of the properties of hypotheses (or objects mentioned in them) already formed. Items are the objects or hypothesis (and sets of these) examined by the discovery program, and a task is an operation on zero or more items. For example, one task might be to find patterns (using an induction engine) in a subset of the data that have an interesting common property, such as being counterexamples to a well-supported rule. Although Lenat's programs discovered interesting conjectures in axiomatic domains such as set theory and games, those programs also contained general, domain-independent heuristics of the same sort used in empirical domains.

To evaluate our framework, we developed the prototype discovery program HAMB (Livingston 2001) that finds interesting, new relationships in collections of empirical data. (2) A key feature of HAMB is its domain-independent heuristics that guide the program's choice of relationships in data that are potentially interesting. HAMB'S primary generator of plausible hypotheses is an inductive generalization program that finds patterns in the data; in our case, it is the rule-induction program RL (Provost and Buchanan 1995). RL is an inductive generalization program that looks for general rules in a collection of data, where each rule is a conditional sentence of the form

IF [f.sub.1] and [f.sub.2] and ... and fn THEN class = K (with CF = c)

Each feature (f) relates an attribute (a variable) of a case to a named value, and a degree of certainty (CF) is attached to each rule as a measure of evidential support in the data; for example:

IF SEX = male and AGE

The conditional rule, which is easily understood by anyone who knows the meanings of the variable names, thus says that if a case matches all the antecedent conditions, then it is likely to be a member of the named class (K). Thus, the items in hamb's ontology are attributes, cases, rule conjuncts, and roles, plus sets of these. The cases and the attributes used to describe them are taken directly from the database.

On each cycle, heuristics can create tasks that result in new items or hypotheses, or tasks that examine some of the properties of those items or hypotheses. Each task must have accompanying text justifications for performing it, which are called reasons, qualitative descriptions of why a task might be worth performing (for example, sets of exceptions to general rules are likely to be interesting), and each reason must have an assigned strength, which is a relative measure of the reason's merit.

A task's plausibility is an estimate of the likelihood that performing the task will lead to interesting discoveries, and it is calculated as the product of the sum of the interestingness of the items involved in the task and the sum of the strengths corresponding to the reasons assigned to the tasks, as illustrated in the following equation:

Plausibility(T) = ([summation] [R.sub.T]) * {[summation] Interestingness(IT)}

where T is a task, ([R.sub.T]) is the set of the strengths of T's reasons, and {Interestingness(IT)} represents the sum of the estimated interestingness of T's items.

Tasks are performed using heuristics and, when executed, create new items for further exploration and place new tasks on the agenda. When proposing a new task, a heuristic must also provide reasons and corresponding strengths for performing the task. If new reasons are given for performing a task already on the agenda, then they are attached to the existing task, increasing its plausibility. Therefore, the framework provides three additional properties that Lenat (1982) identified as desirable when selecting the next task to perform:

First, the plausibility of a task monotonically increases with the strength of its reasons. Therefore, with all else being equal, a task with two reasons will have a greater plausibility than a task with only one of those reasons. If a new supporting reason is found, the task's value is increased. (3) The better that new reason, the bigger the increase.

Second, if a task is reproposed for the same reason(s), its plausibility is not increased.

Third, the plausibility of a task involving an item C should increase monotonically with the estimated interestingness of C. Two similar tasks dealing with two different concepts, each supported by the same list of reasons and strengths of reasons, should be ordered by the interestingness of those two concepts.

Thus, the top-level control of the framework is a simple loop: (1) calculate the plausibilities of the tasks; (2) select the task with the greatest plausibility; and (3) perform the task, possibly resulting in the creation or examination of items, the evaluation of relationships between items, and the proposal of new tasks. At the end of each iteration of this loop (called a discovery cycle), a stopping condition is checked to determine if further exploration is warranted. In our prototype program, HAMB, the stopping condition is that either the plausibility of all tasks on the agenda fails below a user-specified threshold (that is, no task is interesting enough), or the number of completed discovery cycles exceeds a user-defined threshold. In cases of repeated consideration of the same task, the system detects the possible deadlock...

NOTE: All illustrations and photos have been removed from this article.

No comments: