o Tim Finin o UMBC, Baltimore MD o finin@umbc.edu o http://ebiquity.umbc.edu/

Eliminating Duplicate Answers in Prolog

The Question

I was recently asked

Say I have the sister relation defined like this:
  sister(X,Y) :- female(X),parent(P,X),parent(P,Y),\+ X == Y.
And say I have a DB full pairs that Prolog can figure out are sisters. If I try to find all the sisters using a query like this:
  ?- sister(X,Y).
I get each pair in each order. Is there any way to make the query such that all of the pairs of sisters are listed only once, in whichever order Prolog finds first?

The Problem

This is a classic problem that's also common in a databases query context (SQL is also very declarative). The problem gets right to the issue of whether we want to program declaratively or procedurally. Additional complications are introduced by the desire to write Prolog predicates that can serve either to prove that a relation holds between values or to generate values for which a given relation holds. I tried to fire off a quick and hopefully helpful response to the question, but it spun out of control. To quote Piet Hein, "Problems worthy. of attack. prove their worth. by hitting back."

Let's assume a database about the Simpsons:

  parent(homer,bart). parent(marge,bart). male(bart).
  parent(homer,lisa). parent(marge,lisa). female(lisa).
  parent(homer,maggie). parent(marge,maggie). female(maggie).

There are two issues involved. One is that the sister/2 predicate can prove every ordered pair twice, since two people are sisters if the first is female and they share a common parent and a pair might share two common (biological) parents. The second is that you only want to see a given pair once in the output. Sister may be a poor example of this problem since sister(X,Y)<=/=>sister(Y,X), so knowing that Lisa is the sister of Bart tells us nothing about whether or not Bart is also the sister of Lisa.

But both of these problems are in full force if we consider the related predicate sibling/2, which is, of course, symmetric.

The road to a solution

The simple definition of sibling is:

  sib1(X,Y) :- parent(P,X), parent(P,Y), \+X=Y.

For the Simpsons database, solutions of sib1(X,Y) include sib1(bart,lisa) twice as well as sib1(lisa,bart) twice. What follows is an evolutionary path getting to a predicate that (i) generates each answer, but only once and (ii) also only includes a pair in only one order. I put the code and some test data in online and verified that it works in Sicstus Prolog.

Fix one. We first try to fix this problem with a cut to prevent generating alternate proofs:

  sib2(X,Y) :- parent(P,X), parent(P,Y), (\+X=Y), !.

but this prevents us from using it to generate pairs of siblings by calling it with two uninstantiated variables, as can be seen from this session. Note:

Fix two. The "right" way to solve this problem is to use Prolog's setof/3 predicate to first generate a set of all of the provable results, and then offer up the results, one at a time, using member/2.

  sib3(X,Y) :- setof((X,Y), P^(parent(P,X),parent(P,Y), \+X=Y), Sibs),
               member((X,Y), Sibs).

sib3 will generate all ordered pairs for which a sibling relation holds, but it still generates a pair twice, once in each order.

Fix three. A standard trick (used in databases, too) is to add a constraint that will only generate one of the pairs at the expense of the other. If the values were strings, we might only select the pair where the names were in alphabetical order. We can try this in Prolog using the term comparison operator @< which is true when one term comes before the other in 'standard order'.

  sib4(X,Y) :- setof((X,Y), P^(parent(P,X),parent(P,Y), X@<Y), Sibs),
               member((X,Y), Sibs).

Which works if sib4 is called with two variables, but fails when it shouldn't if some variables are instantiated. sib4(lisa,bart) fails as does sib4(maggie,X). writing a Prolog predicate that works efficiently for all combinations of instantiated and uninstantiated input arguments is tricky.

Final fix. So the final attempt at a solution needs to go something like this. (i) generate all results, with duplicates; (ii) remove duplicates; (iii) generate as answers those remaining pairs for which there is not also a similar pair whose args aren't in standard order.

  sib5(X,Y) :- setof((X,Y), P^(parent(P,X),parent(P,Y), \+X=Y), Sibs),
               member((X,Y), Sibs),
               \+ (Y@<X, member((Y,X), Sibs)).

sib5/2 works with any combination of instantiate/uninstantiated arguments.

Parting thoughts

sib5 is not as efficient as it could be. But making it more so would further obscure it. This is one unseemly aspect of logic programming. But maybe we're being unfair since it's generally true in most programming languages.
sib5 is fundamentally inefficient in that it needs to find all possible answers before giving you even the first. It would be very bad if, for example, you wanted to generate 10 pairs of siblings from the IRS database of 200M people. There are probably LP languages that have addressed this general problem, btw, using some notion of streams.
sib1 is elegant and sib5 is not. This is typical: simple Prolog programs can be beautiful, but it's hard to hold onto the beauty as we ask more of the programs. I think this is the world's fault, not ours. It was that way when I got here, in any case.
This is a good example, even for an overview of Prolog, that points out some of complexities that arise in Prolog and most LP languages.

12-Mar-2006 2:27 PM