Pavlov and the Prisoner's Dilemma [Protected Link] Pavlov and the Prisoner's Dilemma

<-Back

```@Article{kraines-1989a,
author         = {David Kraines and Vivian Kraines},
title          = {Pavlov and the Prisoner's Dilemma},
year           = {1989},
volume         = {26},
value          = {ab},
review-dates   = {2005-02-26},
pages          = {47--79},
journal        = {Theory and Decision},
hardcopy       = {no},
key            = {kraines-1989a}
}
```

Summary

"Our Pavlov is learning by conditioned response, through rewards and punishment, to cooperate or defect."

"Keywords: Game theory, prisoner's dilemma, Markov chain, evolution of cooperation."

Bottom line:

• Reward increases likelihood of an action
• Punishment decreases likelihood of action

Payoffs altered, but similar to standard payoffs. Key distinction is that CD and DC payoffs do not match original values.

This paper payoffs: CC= 1, 1; CD=-2, 2; DC= 2,-2; DD=-1,-1
Adjusted paper pay: CC= 3, 3; CD= 0, 4; DC= 4, 0; DD= 1, 1
Standard payoffs..: CC= 3, 3; CD= 0, 5; DC= 5, 0; DD= 1, 1

So R+P = T in this game, whereas normally R+P < T. I don't remember any key properties in this payoff relationship mentioned elsewhere (such as constraint in IPD that P+T < 2R).

[Note: But later it looks like the payoffs have been adjusted to C=0,0. Especially note page 63, where C 1.0 state leads to tie 0,0 payoff, which would seem to be ideal mutual cooperation, yet that would map to 3,3 in normal payoffs (0,0) -> (3,3).]

Names of strategies in this paper.

• Stalin* -> All-D
• Monte* -> Random
• Gandhi* -> All-C
• TT -> Tit-for-tat (TFT)
• PT -> Perfect Trainer, "who has deep psychological insight"
• NT -> Normal Trainer, "without the special talent of PT"
• Pavlov -> A clone of Pavlov
"*'d strategies can train Pavlov to cooperate with certainty."

PT knows exactly the probability that Pavlov will cooperate or defect. It is thus always (due to analysis) in its best interest to do the same as Pavlov. Unless Pavlov's learning rate is such n > ~30 (learning rate < ~1/30).

NT guesses as to whether p > 1/2 by looking at Pavlov's recent plays. f(p) is the prob. that NT guesses that p >= 1/2. Claim is that f(0)=0, f(1/2)=1/2, and f(1)=1. [I don't think f(1/2) necessarily is exactly 1/2, though distribution centered there, and if information is only a short streak, then errors of f(0.5)=1 can occur with finite probability]. The assumption goes further, that f(p) = p, for simplicity. [also not a good assumption for short runs, but I get the idea.] NT's best strategy is to approximate PT's strategy with available information. [The problem is that fluctuations due to limited sampling of NT will mean the assumption f(p) = p will be wrong in a significant proportion of cases and lead to longer training... or so I would assert.]

In any event, the tables indicate that NT takes much longer to train Pavlov than PT.

Kraines^2 model the IPD as a finite Markov chain. States represent the probability that Pavlov will cooperate. They use matrix algebra to produce the training time and cost for the differing opponents.

• Probability to cooperate is p
• Learning increment is x, x > 0
• Paper focuses on x = 1/10, always in range 1/4 <= x <= 1/30
• if CC, p <- max(1, p + x) (rewarded for cooperating)
• if DD, p <- max(1, p + x) (punished for defecting)
• if CD, p <- min(0, p - 2x) (punished for cooperating)
• if DC, p <- min(1, p - 2x) (rewarded for defecting)

Did Kraines^2 ever describe why the payoff is only weighted at 1 for CC and DD, but 2 for CD and DC? Are there other more optimal values? No explicit description, but given the rest of the paper, it makes the analysis easier, especially when talking of the interaction with PT and the training progression w.r.t. expected reinforcement ER around p = 1/3 and p = 2/3. The relative strength of the reinforcements also bulwark against any thought of a PT swindling Pavlov and then retraining, unless Pavlov is a *very* slow learner (phase shift where n>~30)

This Pavlov-N scheme is more than memory-one, but updates only take the most recent move and state is minimal (single value).

TFT is successful against rational opponents, but inefficient against Pavlov. Authors hint here at the ability to forgive an occasional slip. "One should be more forgiving of the occasional slip by the basically cooperative opponent and more strict toward the habitual defector. An understanding of the degree of rationality as opposed to conditional response of the opponent is essential in designing an effective strategy."

Key Factors

Relations to Other Work:
• Axelrod and Hamilton, 1984
• axelrod-1984a
• J.G. Kemeny and J.L. Snell, 1976 (Markov Chains)
• A. Rapaport

Problem Addressed: Explaining how a very simple mechanism can beat other strategies, and indicate how trainable it is with other common strategies that could lead to cooperation. Aim is to provide a plausible mechanism

Main Claim and Evidence: Authors establish that Pavlov is better than TFT when matched. Does well with itself and trainers that understand it. Evidence comes from finite Markov chain analysis and other analysis. Some simplifying assumptions may weaken claim w.r.t. Normal Trainer, but Pavlov-Pavlov head to head match-ups indicate strength of approach. [Later papers prove Pavlov-n is effective in noisy environments].

Assumptions:

• A lot of equilibrium analysis sets game length to n (I think), which is the same as the inverse of the learning rate of Pavlov (1/n). For a moment the discussion here sounded like infinite games (though they earlier stated they were finite Markov chains), since the payoff matrix leads to a Markov process that has an "equilibrium probability vector R=(r_0,r_1,...,r_n) where r_k is the proportion of plays for which Pavlov will be in state k/n.".. and the initial value of p doesn't matter. (RP=R; where P is payoff matrix).
• Buy PyP is computed in terms of n itself (so low fast learning, low n games are shorter, both go with n) Huh? This comes right after discussing the "ergodic" nature of the Markov chains and a equilibrium payoff vector.
• OK, given that these are the equilibrium at round n, with learning increment 1/n, what use is this? (Maybe this is in a discussion section later).

Next Steps: No explicitly noted by authors.

Remaining Open Questions:

• How troublesome is the payoff of R + P = T (since T adjusted would only 4 in the most common valuation of the game)? [my question, not author's]
• "Do animals in the wild behave with equal aggressiveness in competitive situation because of a conditioned response or some sort of evolutionary instinct?"

Quality

Originality is outstanding.
Contribution/Significance is outstanding.
Quality of organization is excellent.
Quality of writing is excellent.
<-Back