|
P Q & P |
Q |
They also knew the axiom P Q (i.e., pin pain) and they knew from observation that Q (pain) was true. They then induced P as a hypothesis under the knowledge that, if true, they could use the above rule to prove Q.
Suppose another scenario occurs: the same person is getting a pain in the head when they wear hat X. They look under the hat for no apparent reason, and see it has a pin in it. They then induce the rule that hats with pins in cause pain. In this case, the person has induced the P Q part of Modus Ponens for the same reason as before: they want to prove that the pain is caused by the pin. Again, this was an unsound assumption - there could have been many other rules or other facts about the hat from which the person could have used Modus Ponens to determine the cause of pain.
In both cases, the person has inverted the Modus Ponens rule of inference so that, if they induce a hypothesis in this way, then they will have the peace of mind that, if the hypothesis is true, then they will be able to prove that the theorem statement follows (that they are getting pain). Given this, we see that Modus Ponens is not an ideal choice, because using Modus Ponens alone is not a complete inference method: there are theorems in first order logic which cannot be proved using Modus Ponens alone.
Luckily, we know an inference rule which is complete: resolution. Therefore, it makes sense to work out how to invert this rule. As with Modus Ponens, there are multiple ways to invert the rule, and we look at four possibilities. The first two induce a single hypothesis in such a way that, once the new hypothesis has been induced, a single resolution step is required to carry out the verification (proof). The second two are such that three new hypotheses are induced and it takes two resolution steps to carry out the verification.
- Absorption
If we have the clauses: (q A) and ( p A, B), then we can use the absorption rule as depicted below. Note that we write the given clauses above the line and the induced clauses below the line. Remember also that a capital letter represents a conjunction of literals, whereas a lower case letter represents a single literal.
q A & p A, B |
q A & p q, B |
We see that a new hypothesis has been introduced (the one on the right below the line). This is similar to the one on the right above the line, but the conjunction A has been replaced by the single literal q. The reason we have written the first clause again, is to highlight which clauses below the line are needed in a resolution step to prove the clauses we started with above the line. We can demonstrate that our known clauses follow from our hypothesised clause by carrying out this resolution step. Instead of writing it upside down to demonstrate the deduction, which might be confusing, we write a "v" diagram as follows:
q A p q, B |
p A, B |
Here, the q literal has been resolved to leave the fact that we wanted our hypotheses to prove, as required. We say that absorption is a V-operator. As an exercise, translate this into CNF and convince yourself that resolution does indeed work here.
As an example, suppose we performed the following piece of (inductive) reasoning. We know that anyone who lives in the White House and is a politician is the president. We also know that anybody who is a republican politician living in the White House is male, and we want to explain this. If we make the following substitution into the absorption rule:
p = is_male(X)
q = is_president(X)
A = lives_in_whitehouse(X) is_politician(X)
B = is_a_republican(X)
q = is_president(X)
A = lives_in_whitehouse(X) is_politician(X)
B = is_a_republican(X)
Then we get this induction:
is_president(X) lives_in_whitehouse(X) is_politician(X) is_male(X) lives_in_whitehouse(X) is_politician(X), is_a_republican(X) |
is_president(X) lives_in_whitehouse(X) is_politician(X) is_male(X) is_president(X), is_a_republican(X) |
This means that, in response to the question: "why are all republican politicians living in the White House male?", we have answered: "all republican presidents are male and any politician living in the White House is the president, therefore, all republican politicians living in the White House are male." Hence, we have used inductive inference to hypothesise that republican presidents are male. In this case, the induced rule is correct.
- Identification
If we have the clauses p A, B and p A, q, we could hypothesise that these are explained by the clause: q B. That is, we could use the following way to invert resolution:
p A, B & p A, q |
q B & p A, q |
and we again show that the starting clauses can be deduced from the hypothesis with a "v" diagram:
p A, q q B |
p A, B |
In this case, the q literal has been resolved away. As with absorption, we say that identification is a V-operator. Using the same substitution as above, we get a slightly different story: suppose we know that, as before, anybody who is a republican politician living in the White House is male. Suppose further we know that politician presidents who live in the White House are male, and we want to explain this. Then we can use identification to induce this rule:
is_president(X) is_republican(X)
As all republicans are not president, even though this induced clause explains the observations, it is false.
- Intra-Construction
Suppose we start with two known clauses which have the same head, p, and bodies which both contain a subset of conjoined literals, A:
p A, B
p A, C
p A, C
In this case, we can induce three additional hypotheses as follows:
p A, B & p A, C |
q B & p A, q & q C |
Note that we have introduced a new symbol: q, which was not previously defined. q is a new predicate, and hence this is a process called predicate invention. Few machine learning techniques actually introduce new concepts not defined as part of the background knowledge in such a general way. For instance, in an illustrative (toy) example in the ILP literature, a predicate for insertion in a list is discovered while the ILP system learns the concept of insertion sort.
To prove that the given clauses follow from the hypotheses, we can portray the two resolution steps required in a "w" diagram as follows:
q B p A, q q C |
p A, B p A, C |
We see that in both resolution steps, the invented predicate q is resolved away. We say that Intra-construction is a W-operator.
- Inter-Construction
If we start with two clauses with different heads, but for which the bodies both contain a subset of literals, A:
p A, B
q A, C
then we can induce three hypotheses as follows:q A, C
p A, B & q A, C |
p r, B & r A & q r, C |
In this case, predicate invention has again occurred to introduce new predicate r.
As usual, to prove that the given clauses follow from the hypotheses, we can portray the two resolution steps required in the following "w" diagram:
p r, B r A q r, C |
p A, B q A, C |
We see that, as before, the invented predicate has been resolved away in both steps. As with Intra-construction, Inter-construction is a W-operator.
14.3 Search Considerations
As with the previous two learning methods we've looked at - in particular artificial neural networks - within the overall framework, there is a huge amount of variation in the way in which the learning is carried out. The same is true for ILP - under the general context given in section 14.1, it seems that almost every possibility for generating a suitable logic program has been tried. We look at the tip of this iceberg here.
- Search Direction
In Muggleton and de Raedt's comprehensive study of ILP theory and Methods (Journal of Logic Programming, 1994), they define induction to be the inverse of deduction and introduce the following definitions:
- A hypothesis G is more general than a hypothesis S if and only if G S. S is also said to be morespecific than G.
- A deductive inference rule, r, maps a conjunction of clauses G onto a conjunction of clauses S such that G S. r is called a specialisation rule.
- An inductive inference rule, r, maps a conjunction of clauses S onto a conjunction of clauses G such that G S. r is called a generalisation rule.
Under this formalism, the deductive rules of inference we have looked at in lecture 7 will be thought of as specialisation rules, and the inductive rules of inference derived from inverting resolution above are thought of as generalisation rules. One big contrast between different ILP systems is the way in which they search: either from the general to specific or from the specific to general.
One class of systems work from the specific to general, and start from the examples and background knowledge. That is, they start with the most specific hypotheses, which can be obtained by grounding variables and conjoining terms (see Progol example). They then repeatedly apply generalisation (inductive) rules so that they generate a hypothesis which entails more and more positive examples. Of course, they also take care to make sure the hypothesis does not imply negative examples. Another class of systems work from the general to the specific. These start with the empty clause as the hypothesis, which entails everything. They repeatedly apply specialisation (deductive) rules so that they generate a hypothesis which entails fewer and fewer negative examples. Of course, these systems take care to make sure the hypothesis continues to imply all the positive examples.
- Pruning and Sorting
Because we can test whether each hypothesis explains (entails) a particular example, we can associate to a hypothesis a set of positive elements that it explains and a similar set of negative elements. There is also a similar analogy with general and specific hypotheses as described above: if a hypothesis G is more general than hypothesis S, then the examples explained by S will be a subset of those explained by G.
We will assume the following generic search strategy for an ILP system: (i) a set of current hypotheses is maintained, QH (ii) at each step in the search, a hypothesis H is taken from QH and some inference rules applied to it in order to generate some new hypotheses which are then added to the set (we say that H has beenexpanded) (iii) this continues until a termination criteria is met.
This leaves many questions unanswered. Looking first at the question of which hypothesis to expand at a particular stage, ILP systems associate a label with each hypothesis generated which expresses a probability of the hypothesis holding given that the background knowledge and examples are true. Then, hypotheses with a higher probability are expanded before those with a lower probability, and hypotheses with zero probability arepruned from the set QH entirely. This probability calculation is derived using Bayesian mathematics and we do not go into the derivation here. However, we hint at two aspects of the calculation in the paragraphs below.
In specific to general ILP systems, the inference rules are inductive, so each operator takes a hypothesis and generalises it. As mentioned above, this means that the hypothesis generated will explain more examples than the original hypothesis. As the search gradually makes hypotheses more general, there will come a stage when a newly formed hypothesis H is general enough to explain a negative example, e-. This should therefore score zero for the probability calculation because it cannot possibly hold given the background and examples being true. Furthermore, because the operators only generalise, there is no way by which H can be fixed to not explain e-, so pruning it from QH because of the zero probability score is a good decision.
A similar situation occurs in general to specific ILP systems, where the inference rules are deductive, hence they specialise. At some stage, a hypothesis will become so specialised that it fails to explain all the positive examples. In this case, a similar pruning operation can be imposed because further specialisation will not rectify the situation. Note that in practice, to compensate for noisy data, there is more flexibility built into the systems. In particular, the posterior conditions which specify the problem can be relaxed, and the pruning of hypotheses which explain small numbers of negative examples may not be immediately dropped.
We can see how the examples could be used to choose between two non-pruned hypotheses: if performing a specific to general search, then the number of positive examples explained by a hypothesis can be taken as a value to sort the hypotheses with (more positive examples explained being better). Similarly, if performing a general to specific search, then the number of negatives still explained by a hypothesis can be taken as a value to sort the hypotheses with (fewer negatives being better).
This may, however, be a very crude measure, because many hypotheses might score the same, especially if there is a small number of examples. When all things are equal, an ILP system may employ a sophisticated version of Occam's razor, and choose between two equal scoring hypotheses according to some function derived from Algorithmic Complexity theory or some similar theory.
- Language Restrictions
Another way to reduce the search space is to be more specific about how the predicates in the hypothesis Horn clauses will look. One approach to this is to limit the number of existentially quantified variables allowed in the learned clauses. Another approach is to more explicitly specify what the learned hypotheses will look like, in terms of restrictions on both their head and their body. In particular, for each predicate to appear in the body of hypotheses, in some ILP systems, it is possible to specify whether each argument in the predicate is to be a ground term, a new variable or a variable given in the head. Such fine-tuning of the background information can reduce search times dramatically.
14.4 An Example Session with Progol
Progol is a state of the art ILP system developed by Stephen Muggleton (a Professor here at Imperial). There are 5 versions of Progol available for download here: http://www.doc.ic.ac.uk/~shm/Software/. Note that each version performs differently, so the results for the same data set may differ between systems. One of the example files supplied is the "animals" toy dataset which we have used for illustrative purposes previously. Our purpose here is to look at this input file and look at the output produced by Progol, and highlight the most important parts of both.
The purpose of this machine learning exercise is to learn a set of rules for the taxonomic classification of animals into one of the following classes: mammal, fish, reptile and bird. The rules will be defined in terms of a set of attributes that are given as background concepts. These attributes include physical properties such as the number of legs (0,1,2,4) and the covering of the animal (fur, feathers, none, scales); and other attributes such as the habitat(s) the animal lives in (land, air, water) and whether the animal produces eggs, etc. 16 animals are supplied and 7 attributes to derive rules from are also given.
- The Input File
At the top of the input file, there are a set of mode declarations which are language restrictions as described above. The first mode declaration dictates what the head of the hypothesis logic program clauses will look like. The mode declaration is written thus:
:- modeh(1,class(+animal,#class))?
This stipulates that the head of the hypothesis clauses will have predicate name class and that the first argument will take in a given animal name (specified by the +) and the second argument will return a ground instance (specified by the #) of type class. Hence this means that the learned predicate will take in the name of an animal and return a class, as required.
The other mode declarations stipulate the format of the predicates in the body of the clauses. For instance, this mode declaration:
:- modeb(1,has_eggs(+animal))
specifies that the predicate has_eggs/1 will be available for the body of the hypothesis clauses (although it isn't necessary to include it). Also, this mode declaration:
:- modeb(1,has_legs(+animal, #nat))?
stipulates that the predicate has_legs/2 can be put in the body of the hypothesis clauses. It will appear with an instantiated natural number for the second argument.
Following the mode declarations, there is information about the types of ground variables in the background information:
animal(dog). animal(dolphin) ... animal(penguin). class(mammal). class(fish). class(reptile). class(bird). covering(hair). covering(none). covering(scales). covering(feathers). habitat(land). habitat(water). habitat(air). habitat(caves).
Following the type information comes the background concepts, instantiated for the animals available:
has_covering(dog, hair). ... has_covering(crocodile, scales). etc. has_legs(dog,4). ... has_legs(penguin, 2). etc. has_milk(dog). ... has_milk(platypus). etc. homeothermic(dog). ... homeothermic(penguin). etc. habitat(dog, land). ... habitat(penguin, water). etc. has_eggs(platypus). ... has_eggs(eagle). etc. has_gills(trout). ... has_gills(eel). etc.
Finally, the input file has the positive and negative examples:
% Positives: class(lizard, reptile). class(trout, fish). class(bat, mammal). etc. % Negatives: :- class(trout, mammal). :- class(herring, mammal). :- class(platypus, reptile). etc.
- The Output File
In the output file, we get some indication of how the search progressed. First of all, Progol starts with the example class(lizard,reptile). It then uses the background predicate values for lizard to come up with this most specific hypothesis for an animal being a reptile:
class(A, reptile) :- has_covering(A, scales), has_legs(A, 4), has_eggs(A), habitat(A, land).
Progol then tries various generalisations of this hypothesis such as:
class(A, reptile) :- has_covering(A, scales). class(A, reptile) :- has_eggs(A), has_legs(A, 4).
From the 12 generalisations found, it chooses this one as the best:
class(A, reptile) :- has_covering(A, scales), has_legs(A, 4).
The generation of the most specific hypothesis followed by the generalisation routine continues in order to find rules for the animals classed as mammal, fish and bird. At the end of the file, the chosen rules are summarised thus:
class(A,reptile) :- has_covering(A,scales), has_legs(A,4). class(A,mammal) :- homeothermic(A), has_milk(A). class(A,fish) :- has_legs(A,0), has_eggs(A). class(A,reptile) :- has_covering(A,scales), habitat(A,land). class(A,bird) :- has_covering(A,feathers).
Note that the hypothesis is designed to be used in this order, so that there are, for instance, two tries at checking whether an animal is a reptile. Note that this hypothesis scores 100% for predictive accuracy over the training set. Furthermore, the rules are easily understood.
14.5 Some Applications of ILP
As noted above, the output from ILP systems is quite easy to understand, because it is in the form of logic programs. This is a distinct advantage of ILP over other learning approaches. In particular, this makes ILP a good choice for scientific applications, where it is often more important that the hypotheses discovered by the system can be understood by the scientist user, than the predictive accuracy of the hypotheses. Below are some applications which give a flavour of the real world tasks to which ILP has been applied.
- Finite Element Mesh Design
Finite element methods are used by engineers to analyse stresses in physical structures. These methods depend on modelling the structures with finite meshes at particular resolutions. Considerable expertise is required to choose the correct resolution to avoid unnecessary computational overheads (too fine a mesh), but to reduce approximation errors (not fine enough). ILP systems were used to determine rules for the mesh resolution of edges in the structure in terms of certain properties of the structure being modelled. The Golem, LINUS, FOIL and CLAUDIEN systems were used and produced novel, understandable rules which achieved around 78% predictive accuracy.
- Predictive Toxicology
As lamented in previous lectures, drug companies lose millions of pounds by developing drugs which eventually turn out to be toxic to humans. In one of a number of experiments with example drugs split into toxic and non-toxic sets, Progol was used to determine why certain chemicals were mutagenic (a property related to carcinogenicity) and others were not. It produced rules which achieved 89% predictive accuracy over a set of 188 compounds.
| |
In another data set of 42 chemicals chosen because they were "regression unfriendly", Progol produced a single rule with 88% predictive accuracy. Because of the understandability of it's results, the rule could easily be interpreted as a new structural predictor which is the substructure given above, where the ring represents an aromatic carbon ring. Note the generality of this result: atoms Y and Z can be any atoms.
|
- Protein Secondary Structure Prediction
The structure of a protein determines to some extent its function, and given that most biological processes are influenced by protein functions, the prediction of protein structure is a very important problem. The structures have macro properties containing many atoms which are known as secondary structures. These properties are shapes such as sheets, helices and hair-pins. The application of the Golem ILP system to this prediction task produced better results than any contemporary learning approach. Also, in experiments to qualitatively model the structure-activity relationship of a series of drugs, while Golem didn't produce better accuracy than linear regression, it did provide insight into the stereochemistry of the system on each occasion.
- Generating Program Invariants
Programs written in certain languages can be formally proved to be correct (i.e., perform as per their specifications). To do so, it is necessary to find suitable conditions that always hold at given points in the program. One particular case is to find suitable conditions within loops, and these conditions are called loop invariants. An ILP system was used to generate such loop invariants and did so successfully and straightforwardly. The induction of an invariant for a parallel program was also demonstrated.
0 comments:
Post a Comment