Skip to main content

Full text of "DTIC ADA575140: Behaviorally Modeling Games of Strategy Using Descriptive Q-learning"

See other formats


REPORT DOCUMENTATION PAGE 


Form Approved OMB NO. 0704-0188 


The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, 
searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments 

regarding this burden estimate or any other aspect of this collection of information, including suggesstions for reducing this burden, to Washington 

Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA, 22202-4302. 
Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to any oenalty for failing to comply with a collection of 
information if it does not display a currently valid OMB control number. 

PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS. 


1. REPORT DATE (DD-MM-YYYY) 


2. REPORT TYPE 
Technical Report 


4. TITLE AND SUBTITLE 

Behaviorally Modeling Games of Strategy Using Descriptive 
Q-learning 


3. DATES COVERED (From - To) 


5a. CONTRACT NUMBER 
W911NF-09-1-0464 


5b. GRANT NUMBER 


6. AUTHORS 

Roi Ceren, Prashant Doshi, Matthew Meisel, Adam Goodie, Dan Hall 


5c. PROGRAM ELEMENT NUMBER 
611102 


5d. PROJECT NUMBER 


5e. TASK NUMBER 


5f. WORK UNIT NUMBER 


7. PERFORMING ORGANIZATION NAMES AND ADDRESSES 

University of Georgia Research Foundation, Inc. 

Sponsored Research Office 

University of Georgia Research Foundation Inc 

Athens, GA 30602 - 


9. SPONSORING/MONITORING AGENCY NAME(S) AND 
ADDRESS(ES) 

U.S. Army Research Office 
P.O.Box 12211 

Research Triangle Park, NC 27709-2211 


8. PERFORMING ORGANIZATION REPORT 
NUMBER 


10. SPONSOR/MONITOR'S ACRONYM(S) 
ARO 


11. SPONSOR/MONITOR'S REPORT 
NUMBER(S) 

55749-NS.3 


12. DISTRIBUTION AVAILIBILITY STATEMENT 
Approved for public release; distribution is unlimited. 


13. SUPPLEMENTARY NOTES 

The views, opinions and/or findings contained in this report are those of the author(s) and should not contrued as an official Department 
of the Army position, policy or decision, unless so designated by other documentation. 


14. ABSTRACT 

Modeling human decision making in strategic problem domains is challenging with normative game theoretic 
approaches. Behavioral aspects of this type of decision making, such as forgetfulness or misattribution of reward, 
require additional parameters to capture their effect on decisions. We propose a descriptive model utilizing aspects 
of behavioral game theory, machine learning, and prospect theory that replicates the behavior of humans in 
uncertain strategic environments. We test the predictive capabilities of this model over data from 43 participants 


15. SUBJECT TERMS 

computational modeling, strategic games, behavioral data, reinforcement learning 


16. SECURITY CLASSIFICATION OF: 

17. LIMITATION OF 

15. NUMBER 

19a. NAME OF RESPONSIBLE PERSON 

a. REPORT 

b. ABSTRACT 

c. THIS PAGE 

ABSTRACT 

OF PAGES 

Prashant Doshi 

UU 

UU 

UU 

UU 


19b. TELEPHONE NUMBER 

706-583-0827 


Standard Form 298 (Rev 8/98) 
Prescribed by ANSI Std. Z39.18 


































Report Title 

Behaviorally Modeling Games of Strategy Using Descriptive Q-leaming 

ABSTRACT 

Modeling human decision making in strategic problem domains is challenging with normative game theoretic 
approaches. Behavioral aspects of this type of decision making, such as forgetfulness or misattribution of reward, 
require additional parameters to capture their effect on decisions. We propose a descriptive model utilizing aspects of 
behavioral game theory, machine learning, and prospect theory that replicates the behavior of humans in uncertain 
strategic environments. We test the predictive capabilities of this model over data from 43 participants guiding a 
simulated Uninhabited Aerial Vehicle (UAV) against an unknown automated opponent. 




Behaviorally Modeling Games of Strategy Using 

Descriptive Q-learning 


Roi Ceren 

Department of Computer Science 
University of Georgia 
Athens, GA 30605 
ceren@cs.uga.edu 


Prashant Doshi 

Department of Computer Science 
University of Georgia 
Athens, GA 30605 
pdoshi@cs.uga.edu 


Matthew Meisel 
Department of Psychology 
University of Georgia 
Athens, GA 30605 
mameisel@uga.edu 


Adam Goodie 
Department of Psychology 
University of Georgia 
Athens, GA 30605 
goodie@uga.edu 


Dan Hall 

Department of Statistics 
University of Georgia 
Athens, GA 30605 
danhall@uga.edu 


ABSTRACT 

Modeling human decision making in strategic problem do¬ 
mains is difficult with normative game theoretic approaches. 
Behavioral aspects of this type of decision making, such as 
forgetfulness or misattribution of reward, require additional 
parameters to capture their effect on decisions. We propose 
a descriptive model utilizing aspects of behavioral game the¬ 
ory, machine learning, and prospect theory that replicates 
the behavior of humans in uncertain strategic environments. 
We test the predictive capabilities of this model over data 
from 43 participants guiding a simulated Uninhabited Aerial 
Vehicle (UAV) against an unknown automated opponent. 

Categories and Subject Descriptors 

1.2 [Artificial Intelligence]: Learning —Parameter learn¬ 
ing 

General Terms 

Human Factors, Experimentation 

Keywords 

reinforcement learning, behavioral game theory, human de¬ 
cision making, models 

1. INTRODUCTION 

In strategic, uncertain environments, human decision mak¬ 
ing may not always adhere to normative decision theoretic 
models. When tasked with making decisions in these do¬ 
mains, humans do not always exhibit a clear memory of past 
experiences. In addition, rewards from neighboring strate¬ 
gies may have an impact on decisions, as humans tend to 
spill over rewards from one strategy to another [9], Essen¬ 


tially, human decision making patterns include several cog¬ 
nitive biases which influence their chosen strategy. 

Several behavioral game theory models exist for represent¬ 
ing human decision making [1, 2, 7, 8]. Many of these models 
rely upon reinforcement learning and represent learning as 
the perceived reward of interaction within an environment. 
The application of these game theory models is limited to 
single-shot and repeated games which are represented in nor¬ 
mal form. 

In real-world strategic domains, the environments are largely 
sequential and uncertain. Reinforcement learning is well ex¬ 
plored in these types of problem domains, for which the 
popular reinforcement learning technique, Q-learning, has 
been developed [5]. The Q-learning function determines the 
optimal set of strategies to maximize the total reward by an¬ 
alyzing immediate rewards and potential future rewards as 
a game progresses from state to state. Current applications 
of this technique apply to purely rational decision making. 

This paper presents a study conducted with human sub¬ 
jects to observe decision making patterns. Participants in 
these studies were given the task of observing an unmanned 
aerial vehicle (UAV) navigate through a series of sectors (in 
a 4x4 grid) and assessing the likelihood of their UAV reach¬ 
ing a goal sector without being detected by an automated 
enemy UAV (whose location is largely unknown). The pri¬ 
mary hypothesis of this experiment was to determine if in- 
centivizing their assessment via proper scoring rules would 
improve assessment techniques. The secondary hypothesis, 
and the focus of this paper, was to discover if participants 
were learning in this environment and, if so, to model the 
participants’ learning. While the investigation into incen¬ 
tives does not prove to be a significant result, we observe re¬ 
markable learning and provide an aggregate learning model. 

Reinforcement learning is a convincing model for this do¬ 
main. The UAV problem, while including another agent, 
can be modeled as a single-player game, where the partici¬ 
pant does not model the enemy. The enemy UAV is revealed 
to the participant as moving in a deterministic fashion. The 
participant will always lose if they follow the same trajec¬ 
tory and are in the same state (after the same amount of 
moves) that caused a loss in a previous iteration of the game. 
Therein, the enemy is a part of the game’s environment, and 



need not be modeled explicitly by the participant. 

The task of probability assessment in human decision mak¬ 
ing is also subject to biases [6]. When a participant states 
their probability assessment, it may not be equivalent to 
their believed probability of success. When rewards are non- 
deterministic, such as in gambling, there is much evidence 
that humans, in general, underweight or overweight their as¬ 
sessments at the extreme cases (near 0% or 100%) [4]. Sub¬ 
proportional probability weighting functions map believed 
probabilities to expressed assessments, which is generally 
not a linear mapping, as in the normative case. 

While behavioral game theory, sequential reinforcement 
learning, and probability assessment mapping are well ex¬ 
plored, combining them to a single model is a novel ap¬ 
proach. We establish a formal model that attributes behav¬ 
ioral affects to sequential domains of uncertainty and aug¬ 
ment assessment with a subproportional probability weight¬ 
ing function. We test its predictive capabilities over a data 
set of 43 participants. Our results indicate that this de¬ 
scriptive version of the Q-learning model shows significant 
gains over the respective normative version, as well as other 
baseline comparative models. 

By utilizing a behavioral game theoretic model to predict 
human decision making, we can gain insight into the biases 
that humans suffer from when faced with strategic uncer¬ 
tainty. Models, such as our descriptive Q-learning model, 
are able to illustrate human learning and predict decisions 
that they make in strategic domains. Analyzing the pa¬ 
rameters fit to these models measures the impact that these 
cognitive biases have. 

2. EXPERIMENT: PROBABILITY ASSESS¬ 
MENT FOR STRATEGIC DECISION MAK¬ 
ING 

In a large study conducted with human participants, we 
investigate probability assessment elicited during a strategic, 
uncertain decision making game. We begin with a descrip¬ 
tion of the game followed by a discussion of the methodology 
used to collect participant assessment data. We conclude 
this section with a description of the results generated in 
this study. 

2.1 Study: UAV Game 

To test the assessment techniques of human participants, 
we created a strategic game of uncertainty utilizing a graph¬ 
ical representation of a gameboard. In this sequential game, 
participants observe a UAV (hereafter participant’s UAV) 
moving through a 4 x 4 sector grid from an initial sector 
towards a colored goal sector. Participants are given the ini¬ 
tial location of another UAV (hereafter enemy UAV), but 
no other information about its movement or successive loca¬ 
tions. A trial (the completion of one trajectory) is consid¬ 
ered a "win” if the participant UAV reaches the goal sector, 
or a ’’loss” if it is caught by the enemy UAV. 

Fig. 1 represents the first two sectors visited (or decision 
points) of a trial. The gameboard grants clairvoyance of the 
entire trajectory for the current trial, the initial location of 
the enemy, and the already traveled course. 

The goal of the experiment was to gather the assessments 
of the overall likelihood of a trial’s success from participants. 
Given the knowledge of the initial location of the enemy, as 
well as the growing knowledge of its movements based on 


Training 1 
Decision Point 1 









, 










0 



0 

1 

2 

3 

| Please fill out the 

questionn 

aire at the 

nd ofever 

y decision point 



(a) 




Decision Point 2 


3 






2 






1 


T 




0 


1 





0 

i 

2 

3 

Please fill out the questionnaire at the end of every decision point | 


(b) 


Figure 1: Two decision points of a given trial in 
the UAV game. The participant knows the enemy 
location only on the first decision point. 

losses, this game exemplifies a learning task. 

2.1.1 Participants 

43 participants were included in this study. Participants 
were pulled from a pool of undergraduate students taking 
introductory psychology courses in our university. Partic¬ 
ipants were paid via a variety of payment mechanisms for 
their time. As the initial hypothesis of incentivization tech¬ 
niques was inconclusive, we included all participants, regard¬ 
less of this effect, in this paper. 

2.1.2 Methodology 

Participants play 20 total trials of the game. Two initial 
phases, representing the training phases of the game, con¬ 
sist of 5 trials each. At the end of each of these sets, the 
participant undergoes an intervention, in which the proctor 
of the experiment highlights participant assessments which 
are too high or too low. 

At each decision point, the participant is required to fill 
out a questionnaire. In the questionnaire, the participant 
notes the direction the UAV will move and their estimation 
for the probability that the participant UAV will, without 
being caught, arrive in the next sector and the eventual goal 
sector. After filling out the questionnaire, the participant 
may move onto the next slide of the game. 

2.1.3 Results 

Participant data was broken up into two discrete data sets: 
trials resulting in wins and those resulting in losses. We an¬ 
alyzed the data for trends within the trial (as the UAV ap¬ 
proached the goal sector) and between trials (as participants 
became more familiar with the game). We expect, as a trial 
progresses, that a participant will assess higher likelihoods 
of success as they approach the goal sector. Additionally, 




















as the game progresses, the participant should become more 
confident in their assessments. 


Table 1: Slope analysis of results 


Trend 

Estimate 

P-value 

intercept 

0.3315 

<0.0001 

slope within trial 

0.02053 

0.0016 

slope across trials 

-0.00486 

<0.0001 


(a) Losses 


Trend 

Estimate 

P-value 

intercept 

0.5392 

<0.0001 

slope within trial 

0.05395 

<0.0001 

slope across trials 

-0.00129 

<0.0001 


(6) Wins 


Table 1 above annotates the results of running a gener¬ 
alized linear mixed effect regression analysis over our data 
with random intercept and slope at the decision point and 
trial level. Our results indicate that the estimates given for 
each point is significant. 

When considering assessments as a trajectory progresses, 
participants generally increase their assessments as they ap¬ 
proach the goal sector. The rate by which a participant’s 
stated probability increases for winning trajectories is greater 
than losses. This is to be expected, as participants will be¬ 
come more familiar with the possible movement of the en¬ 
emy, they will become better at predicting eventual losses. 

As participants complete trials, the slope of the change in 
elicited probabilities decreases significantly. This decrease 
in slope indicates that participants are not changing their 
probability assessments as much as they were in previous 
trials, representing a general increase in confidence of the 
participant’s guesses for both wins and losses. The ideal 
case is that, as participants learn how the enemy is moving, 
their slope across trials will approach 0 . 

With the clear trends towards generally increasing assess¬ 
ments as trials progress and the relative growth of confi¬ 
dence as participants complete trials, these results indicate 
a strong justification for the application of a learning model. 

3. DESCRIPTIVE MODEL FOR REINFORCE¬ 
MENT LEARNING 

Our model is an extension of the popular reinforcement 
learning algorithm known as Q-learning. By attributing con¬ 
cepts derived from behavioral game theory to Q-learning, 
we establish a novel framework for descriptive reinforcement 
learning. Additionally, borrowing from concepts in prospect 
theory creates a better mapping of true beliefs to expressed 
probabilities. 

3.1 Normative Q-learning 

Q-learning is a popular machine learning model for repre¬ 
senting learning in sequential domains. It characterizes the 
reinforcement learning problem as a conjunction of previ¬ 
ous information and future rewards, decayed by a discount 
parameter, 7 . Q-learning is an algorithm that exemplifies 
exploration vs. exploitation, which prefers possible future 
payoffs or previously learned payoffs, respectively [5]. This 


decision is mediated by the learning parameter, a. Equation 
1 shows the standard Q-learning function. 

Q(s,a ) = Q(s, a) + a(r(s) + -ymaXa'Qis', a') —Q(s,a)) (1) 

This function serves as a powerful mechanism to model 
learning with long-term optimality. However, it does not 
exemplify the behavioral aspects of human decision making. 
With the concepts derived from behavioral game theory, we 
can apply descriptive parameters to the Q-learning function. 

3.2 Behavioral Q-learning 

The inspiration for the descriptive model is derived from 
behavioral game theory. Several game theoreticians [2, 9, 3] 
have investigated human biases as associated with problems 
of decision making. Their investigations are uniquely in the 
context of single shot and repeated games. 

3.2.1 Behavioral Reinforcement Learning 

Game theory seeks to analyze and explain the mechanisms 
by which decisions are made [1]. Assuming that participants 
understand the game, the environment, and make decisions 
in a purely rational manner, applicable game theoretic mod¬ 
els will be able to predict the behavior of a human. This is 
rarely the case in reality, however. Cognitive biases plague 
the human decision making process, leading to seemingly 
subrational decisions. Behavioral game theory models learn¬ 
ing with these biases in consideration. 

Several models exist that attempt to express learning within 
decision making domains. The reinforcement learning algo¬ 
rithm portrays learning as a function of interaction with an 
environment and the immediate rewards. As an individual 
moves through the world, it experiences stimuli that it at¬ 
tributes to doing a particular action. Algorithmically, the 
reinforcement learning algorithm can be characterized as: 

A c (t) = A c (t — 1 ) + r ( 2 ) 

The attraction to doing a strategy c at time step t is the 
previous attraction to doing strategy c and its immediate 
reward. An attraction may be implemented in many ways, 
but it is essentially a concept representing the desirability 
of taking a particular action. 

Insights from behavioral game theory have provided pa¬ 
rameters that better explains the irrational behavior that 
arises in human decision making [2]. Such concepts include 
forgetfulness (the event of previous information degrading 
in effect on future decisions) and spillover (the phenomenon 
of humans attributing rewards to neighboring strategies). 
Behavioral reinforcement learning can be expressed as: 

A c (t) = (j)A c {t - 1) + (1 — e)r (3) 

A n {t) = (j>A n {t - 1) + (e)r (4) 

(j) represents the forgetfulness parameter, e represents the 
spillover parameter, and A„ represents the attraction to 
strategy n, which is a neighboring strategy to c. Both pa¬ 
rameters are bounded between 0 and 1 . 

Forgetfulness in the context of our domain would imply 
that the experience from a previous trial has a diminished ef¬ 
fect on current experiences. Spillover generally involves the 
misattribution (or ’’generalization”) of rewards to neighbor¬ 
ing strategies. An illustrative example is that of the roulette 




player who places a large bet on a particular number, only 
to have it land on a nearby number [9]. The player may 
have his guess confirmed, since the ball was near their bet, 
regardless that they lost the bet. 

The implementation of the spillover parameter can be con¬ 
ceptualized in a few different ways for our UAV domain. 
Neighboring strategies can be viewed as nearby sectors, di¬ 
rectly adjacent to the sector arrived at. Since the enemy 
moves in a deterministic pattern, the amount of moves that 
have transpired is directly related to the current location 
of the enemy. With this in mind, spillover can also occur 
between these time steps. Figure 2 exemplifies the various 
models that could represent spillover in this domain. 

With Camerer et al.’s introduction of behavioral param¬ 
eters in human decision making, we now introduce our Q- 
learning function as inspired by these concepts. 

3.2.2 Modified Q-learning Function 

Q(s,a ) = 0Q(s,a)+a((l—e)r(s)+ymax a >Q(s',a')—</>Q(s,a)) 

(5) 

Q(sn, a) = rj)Q(sn,a ) + a((e)r(s) - <j>Q{s n , a)) (6) 

<j>, as with its behavioral game theory counterpart, rep¬ 
resents the forgetfulness parameter, which decays the value 
of previous information associated with that state (in our 
case, waypoint sector), a mediates between exploration or 
exploitation, and additionally decays future payoffs to better 
value current information about the state as it approaches 
1. If e is greater than 0, the neighboring states (notated 
as s n and includes all sectors that are 1 move away) gain a 
fraction of the reward observed [2] [7]. 

The future payoff calculation in the Q-learning function 
is of questionable application to our problem domain, how¬ 
ever. In essence, max a ‘ assumes that the future state-action 
pairs will be the optimal choice. Participants in our problem 
domain do not select the movements of the UAV, however. 
With clairvoyance over the trajectory that the UAV will 
travel, participants are likely to base their assessment on 
the path revealed to them. 

Q{s, a) = <j>Q{s, a) +a((l - e)r(s) + yQ(s', tt(s')) - <j)Q{s , a)) 

(7) 

Equation 7 alters the future payoff function to represent a 
next state payoff from the next sector, determined from the 
path revealed to the participant, ir(s') represents the action 
determined from being in state s’, which, in our case, is the 
next sector in the trajectory for the given trial. 

4. PERFORMANCE EVALUATION 

The data collected from the 43 participants from this 
study were broken up into 5 folds, with 8-9 participants per 
fold. Utilizing the Nelder-Mead method 1 , parameters are 
trained over 4 folds and then, to test the predictive capa¬ 
bilities of the model, tested over the remaining fold. For a 
baseline comparison, fits were generated for the normative 
model 2 and compared with the descriptive model, along with 

1 The Nelder-Mead method is a downhill simplex method for 
minimizing an objective function 

2 The normative model does not include any descriptive pa¬ 

rameters. 4> = 1 and t = 0 , while a is still trained. 


the random model 3 and pathological cases 4 . 

Prior to calculating the fit of the descriptive model, we 
must convert the calculated Q-value generated by the de¬ 
scriptive model to a probability assessment that will be com¬ 
pared to the participant data. Q-values for all states are 
initialized to 0, with a Q-value of 1 being allocated to the 
goal state and -1 for all loss states. Q-values approaching -1, 
then, represent a path likely to lead to a loss, whereas those 
approaching 1 indicate a possible win from that path. To 
convert these values to assessments, then, involves normal¬ 
izing the Q-value between 0 and 1. The resulting conversion 
is then used as the Q-learning function’s assessment. 

Fits were generated by taking the squared distance be¬ 
tween the participant’s stated probability and the model’s 
generated probability at each decision point in the game. 
The model was subjected to a simulation of the game, where 
it was presented with the same trajectories and experienced 
the same outcomes as participants. At each point where 
a Q-value was updated (following a simulation of a leg of 
a trajectory), the distance between all participants’ proba¬ 
bility assessments and the estimated Q-value were squared, 
aggregated, and added to the total fit. 


Table 2: Spillover fits 


Fig. 2.b 

Fig. 2.c 

Fig. 2.d 

Fig. 2.e 

415.534 

415.924 

416.122 

409.254 


In generating the results, we found the best fits to adhere 
to Figure 2.e, annotated in Table 2. This indicates that 
participants considered negative and positive payouts to be 
irrespective of the decision point. Essentially, if a partici¬ 
pant were to lose in sector [1,2] in the 3rd decision point, 
they would evaluate sector [ 1 , 2 ] negatively in the 2 nd and 
4th decision point as well, while also avoiding neighboring 
sectors ([1,1], [1,3], and [ 2 , 2 ]) in those time steps as well. 

Table 3: Descriptive model: parameters and fits 


a 

0 

e 

Fit 

0.819 

0.591 

0.537 

409.254 


Table 3 annotates the results from optimizing our Q-learning 
function utilizing the Nelder-Mead method. Our compara¬ 
tive analysis between models is described in Table 4. The 
descriptive model outperformed the normative model with 
p < 0.01. Additionally, the descriptive model had a better 
test fit than the random and pathological models. 

4.1 Improving Model Predictions 

Although our results are significant, improvements can be 
made to the predictive capabilities of our model. Humans 
not only exhibit cognitive biases in generating their probabil¬ 
ities, but they additionally misrepresent those probabilities 
[ 6 ]. By including a theoretically sound probability weighting 
function, we improve our descriptive model by replicating 
this behavior. 

3 The probability estimations are completely random for 
each decision point within a trial. 

4 Pathological cases include the categorical optimist and pes¬ 
simist (who always guess 100 % and 0 %, respectively) 






(a) (ft) (c) (d) (e) 


Figure 2: (a) No spillover, (b) local spillover, (c) time step spillover, (d) fractional time step and location 
spillover, (e) full time step and location spillover. 

Table 4: Fit comparison for all models 


Descriptive 

Normative 

Random 

Optimist 

Pessimist 

409.254 

416.409 

891.182 

1052.723 

1971.333 


4.1.1 Probability Weighting 

Prospect theory notes that the weight given to probabil¬ 
ity assertions and the associated payoff values are usually 
not linear. That is, humans tend to under- or over-weight 
probability assessments in domains of chance. In our do¬ 
main, participants are queried with their assessment for the 
overall success of their current trial as it progresses, which 
is subject to non-linear assessment mappings. To this end, 
we included a subproportional function in the mapping of 
Q-values to probability assessments [6]. 

w(p) = exp{-(-ln(p)f) (8) 

Equation 8 defines the subproportional function for a given 
probability, p. Between 0 and 1, the exponent (3 causes the 
curve to be inverse sigmoidal. This indicates that proba¬ 
bilities are overweighted when low and underweighted when 
high. Inversely, if /3 is above 1, the curve becomes sigmoidal. 
At 1, the curve is linear, which is the normative case. Figure 
3 illustrates the curves generated from example values. 

4.1.2 Results 

We ran the same simulation from the original descriptive 
model on the probability weighting descriptive model. As in 
the original model, we also compared the augmented model 
with the 43 participants from the UAV study, aggregating 
fits by squaring the distance from the probability weighting 
descriptive model’s Q-values to the participants’ probability 
assessments. 

Table 5: Descriptive model: parameters and fits 

a 4> e (3 Fit 

0.677 0378 0 273 0573 401.36 


Including Prelec’s probability weighting function improved 
the performance of the descriptive model. Table 5 describes 
the averages for the parameters across folds and the fit gen¬ 
erated by the model. Both a and (j> decreased as a result of 
the inclusion. 

Table 6: Comparative Fits 
Descriptive (Weighted) Descriptive (Unweighted) 

401.36 409.254 

Table 6 shows a side-by-side comparison of the descriptive 


model’s fit both with and without the probability weighting 
function. A two-tailed T-test of the distance between the 
each model version’s generated probability resulted in a sig¬ 
nificant p-value of less than 0.01. Since the weighted model 
is a significant improvement over the unweighted model, it 
is, transitively, an improvement over the normative model 
as well. 

5. ANALYSIS 

5.1 Parameters 

Analysis of the test fits for the descriptive model illumi¬ 
nated some behaviors of human participants in sequential 
strategic games. The first observation we made is that the 
higher value for /3 is representative of a decision making 
pattern that may be characteristic of win-or-lose strategic 
games. Traditionally, in betting games, participants tend 
to avoid extreme estimations [6]. However, in the unknown 
environment of our particular domain, a cursory glance at 
the raw data indicates a predilection towards extreme prob¬ 
ability assessments, which our model corroborates. 

The results also indicate a higher preference for exploita¬ 
tion of knowledge in our domain. rf> values converged, on 
average, near 0.5, with slightly higher a values. A <j> value to¬ 
wards 0.38 would indicate that participants’ previous knowl¬ 
edge is deteriorating at a rate of about a third of the reward 
from the last time the state was visited, a tuned around 
0.677 would indicate a higher rate of exploration as partic¬ 
ipants move through the game. That is, participants are 
valuing new information at 68% of its actual reward. 

The observation of the e parameter bears discussion, as 
well. A spillover rate of 27% is relatively high in comparison 
to other implementations of this parameter in reinforcement 
learning [2]. This would indicate that participants were at¬ 
tributing around a quarter of the received reward for a sector 
to its neighboring sectors. 

5.2 Projected probabilities 

As with our cursory analysis of the data received from par¬ 
ticipants, plots of the models’ probability estimations were 
categorized by wins and losses when compared with the es¬ 
timates made by participants. 

Figure 4 plots the average probabilities for trials generated 
from the various models (descriptive with weighting, descrip¬ 
tive without weighting, and the normative model) and the 
data. Figure 4.a shows a relatively similar curve between 
the models and the data, with the descriptive model with 









(a) (&) (c) 

Figure 3: (a) j3 = 0.56, (b) /3 = 1 (linear), (c) /3 = 1.6 




(a) Wins 


(6) Losses 


Figure 4: Trial averages 


weighting being the closest in overall distance. For figure 
4.b, the shape is also similar to the data, but the descriptive 
model with weighting is no longer the closest. As we’ll see 
with later plot analysis, the models are less accurate on the 
trials that result in a loss, indicative of a different type of 
learning and probability assessment in those cases. 

Figure 5 shows the plots for the averages of probability 
assessments for individual decision points made by partici¬ 
pants and generated by the models for trials that resulted 
in a win. Trials that result in wins can be categorized into 3 
different trajectory lengths. If the participant’s UAV eventu¬ 
ally reaches the goal sector, it will do so in 4, 6, or 8 moves. 
Figure 5.a shows the overall plot for averages of decision 
points regardless of the trial type. While the overall fit for 
the descriptive model with weighting is the closest, the plot 
has a strange shape. This is due to the different amount of 
data points for trials of different lengths (e.g. there are only 
3 trials of length 8, but there are 11 total trials that result 
in a win) and the different types of behavior in the various 
trial lengths. Figures5.b, 5.c, and 5.d show the underlying 
behavior for trials of each length, with the descriptive model 
with weighting outperforming the other models in each case. 

Figure 6 shows the plots for averages of probability as¬ 
sessments for the data and models over trials that consist of 
losses. These trials break down into 3 and 5 point trials and 
are categorized accordingly. As with the plot for the loss 
trial averages, the models tend to perform worse on decision 
point averages for loss trials. Participants, on average, start 
with much lower assessments than with trials that result in 
a win. This indicates that participants are better at iden¬ 
tifying eventual losses and retain their pessimism as trials 



(c) 6 points 


(d) 8 points 


Figure 5: Decision point averages (wins) 


progress. The models, on the other hand, become progres¬ 
sively more pessimistic. The data for the 5 point trial is 
completely flat as there is only one trial that is 5 points in 
length (that results in a loss) and the model is not able to 
acquire enough information to give an accurate assessment. 

5.3 Discussion 

The results of the fitting of this model are illuminating. 
They are indicative of the relative power of behavioral game 
theoretic parameters in a sequential learning model. The 
addition of a probability weighting curve further improved 
our results. 

Though the analysis on reinforcement learning in this do¬ 
main indicates a significant gain from the inclusion of be¬ 
havioral parameters, other competing learning models can 
be compared as a baseline for the effectiveness of reinforce¬ 
ment learning in this domain. Several behavioral approaches 
to belief-based learning may be applicable to the sequential 
strategic game utilized in this paper. Camerer et al. have 
proposed alternative models to reinforcement learning in be- 





















Acknowledgments 



(a) All points 


(6) 3 points 



(c) 5 points 


Figure 6: Decision point averages (losses) 


havioral game theory that may beg further investigation. 


6. RELATED WORK 

Several other models exist that seek to express descrip¬ 
tive learning in human decision making domains. Besides 
reinforcement learning, belief learning, experience-weighted 
attraction learning, imitation, and direction learning also 
represent other approaches to behavioral game theory [1], 

Belief learning represents learning as a process of basing fu¬ 
ture considerations on observed behavior in the last round 
[3]. In our domain, it is possible for participants to consider 
their rewards as dependant on the movement of the enemy, 
but, considering the lack of information associated with the 
enemy, it is likely that their wins and losses are modeled as 
an aspect of the environment. 

Erev and Roth also investigate descriptive reinforcement 
learning, but it is examined in repeated stage games, not the 
sequential domain [8]. Many of the applications of our model 
are present in their work, but the concept of uncertainty 
and generalizations of strategy are not implemented in their 
analysis. 

Our work extends observations from Camerer et al.’s Experience- 
Weighted Attraction model, though it has similar shortcom¬ 
ings as the Erev and Roth model [2]. This model is con¬ 
textual to stage games, as opposed to the sequential envi¬ 
ronment of the UAV and other strategic problems. Addi¬ 
tionally, the applicability of the law of simulated effect 5 is 
less pronounced in our model, as the payouts for foregone 
strategies are unknown. 


This work was supported by a grant from the Army RDE- 

COM, grant #W911NF-09-l-0464, to Prashant Doshi, Adam 

Goodie and Dan Hall. We thank Xia Qu for providing help 

and support during the conduct of this research. 

7. REFERENCES 

[1] C. Camerer. Behavioral Game Theory. Princeton 
University Press, Princeton, New Jersey, 2003. 

[2] C. Camerer and T. Ho. Experience-weighted attraction 
learning in normal form games. Econometrica, 
26(4):827-874, July 1999. 

[3] A. Cournot. Recherches sur les principes 
mathematiques de la theorie des richesses. Haffner, 
London, 1960. Translated by N. Bacon as Researches in 
the Mathematical Principles of the Theory of Wealth. 

[4] R. Gonzales and G. Wu. On the shape of the 
probability weighting function. Cognitive Psychology, 
38(1): 129—166, February 1999. 

[5] L. Kaelbling, M. Littman, and A. Moore. 

Reinforcement learning: A survey. Journal of Artificial 
Intelligence Research, 4:237-285, May 1996. 

[6] D. Prelec. The probability weighting function. 
Econometrica, 4(3):497-527, May 1998. 

[7] A. Roth and I. Erev. Learning in extensive-form games: 
Experimental data and simple dynamic models in the 
intermediate term. Games and Economic Behavior, 
8:164-212, 1995. 

[8] A. Roth and I. Erev. Predicting how people play 
games: Reinforcement learning in experimental games 
with unique, mixed strategy equilibria. The American 
Economic Review, 88(4):848-881, 1998. 

[9] W. Wagenaar. Paradoxes of Gambling Behavior. 
Lawrence Erlbaum, Mahwah, New Jersey, 1984. 


5 The law of simulated effect states that foregone strategies 
that are known to have produced better results if chosen will 
have a higher attraction in subsequent games.