The **datafsm** package implements a method for automatically generating models
of dynamic processes that both have strong predictive power and are
interpretable in human terms. We use an efficient model representation and a
genetic algorithm-based estimation process. This paper offers a brief overview
of the software and demonstrates how to use it.

This package implements our method for automatically generating models of
dynamic decision-making that both have strong predictive power and are
interpretable in human terms. We use an efficient model representation and a
genetic algorithm-based estimation process to generate simple deterministic
approximations that explain most of the structure of complex stochastic
processes. The genetic algorithm is implemented with the **GA** package
(Scrucca 2013). Our method, implemented in
C++ and R, scales well to large data sets. We have applied the package to
empirical data, and demonstrated the method's ability to recover known
data-generating processes by simulating data with agent-based models and
correctly deriving the underlying decision models for multiple agent models and
degrees of stochasticity.

A user of our package can estimate models by providing their data in a common "panel data" format. The package is designed to estimate time series classification models that use a small number of binary predictor variables and move back and forth between the values of the outcome variable over time. Larger sets of predictor variables can be reduced to smaller sets with cross-validation. Although the predictor variables must be binary, a quantitative variable can be converted into binary by division of the observed values into high/low classes. Future releases of the package may include additional estimation methods to complement genetic algorithm optimization.

# Load and attach datafsm into your R session, making its functions available: library(datafsm)

Please cite the package if you use it with the text generated by:

citation("datafsm")

rounds <- 10 reps <- 1000 noise_level <- 0.05

To quickly show it works, we can create fake data. Here, we generate `r reps`

repetitions of a `r rounds`

-round game in which each player starts by making a
random move, and in subsequent rounds, each player follows a "noisy tit-for-tat"
strategy that's equivalent to tit-for-tat, except that with a
`r scales::percent(noise_level, accuracy = 1)`

probability the player will make
a random move.

seed1 <- 1372456261 seed2 <- 597693057 seed3 <- 805078823 set.seed(seed1) cdata <- data.frame(outcome = NA, period = rep(seq(rounds), reps), my.decision1 = NA, other.decision1 = NA) # # Prisoner's dilemma # pd_outcome <- function(player_1, player_2) { # # 1 = C # 2 = D # player_1 + 1 } tit_for_tat <- function(last_round_self, last_round_opponent) { last_round_opponent } noisy_tit_for_tat <- function(last_round_self, last_round_opponent, noise_level) { if (runif(1,0,1) <= noise_level) { sample(0:1,1) } else { last_round_opponent } } for (i in seq_along(cdata$period)) { if (cdata$period[i] == 1) { my.decision <- sample(0:1,1, prob = c(1 - noise_level, noise_level)) other.decision <- sample(0:1,1, prob = c(1 - noise_level, noise_level)) cdata[i, "outcome"] <- pd_outcome(my.decision, other.decision) } else{ my.last <- my.decision other.last <- other.decision my.decision <- noisy_tit_for_tat(my.last, other.last, noise_level) other.decision <- noisy_tit_for_tat(other.last, my.last, noise_level) cdata[i,c("outcome", "my.decision1", "other.decision1")] <- c(pd_outcome(my.decision, other.decision), my.last, other.last) } }

The only required argument of the main function of the package, `evolve_model`

,
is a `data.frame`

object, which must have 3-5 columns. The first two columns
must be named `period`

and `outcome`

(`period`

is the time period that the
`outcome`

action was taken). The remaining one to three columns are predictors,
and may have arbitrary names. Each row of the `data.frame`

is an observational
unit, an action taken at a particular time and any relevant variables for that
time. All of the (3-5 columns) should be named. The period and outcome columns
should be integer vectors --- e.g. `c(1,2,1)`

--- and the columns with the
predictor variable data should be logical vectors --- e.g.
`c(TRUE, FALSE, FALSE)`

--- or vectors that can be coerced to logical with
`as.logical()`

.

Here are the first eleven rows of this fake data:

knitr::kable(head(cdata, 11))

We can estimate a model with that data. `evolve_model`

uses a genetic algorithm
to estimate a finite-state machine (FSM) model, primarily for understanding and
predicting decision-making. This is the main function of the `datafsm`

package.
It relies on the `GA`

package for genetic algorithm optimization. We chose to
use a GA because GAs perform well in rugged search spaces to solve integer
optimization problems, are a natural complement to our binary representation of
FSMs, and are easily parallelized.

`evolve_model`

takes data on predictors and data on the outcome and
automatically creates a fitness function that takes training data, an action
vector that `evolve_model`

generates, and a state matrix `evolve_model`

generates as input and returns numeric vector whose length is the number of
rows in the training data. `evolve_model`

then computes a fitness score for
that potential solution FSM by comparing it to the provided `outcome`

in the
real training data. This is repeated for every FSM in the population and then
the probability of selection for the next generation is proportional to the
fitness scores. If the argument `cv`

is set to `TRUE`

the function will call
itself recursively while varying the number of states inside a cross-validation
loop in order to estimate the optimal number of states, then it will set the
number of states to that optimal number and estimate the model on the full
training set.

If the argument `parallel`

is set to `TRUE`

, then these evaluations are
distributed across the available processors of the computer using the
`doParallel`

functions; otherwise, the evaluations of fitness are conducted
sequentially. Because this fitness function that `evolve_model`

creates must
loop through all the training data every time it is evaluated and we need to
evaluate many possible solution FSMs, the fitness function is implemented in
C++ to improve its performance.

set.seed(seed2) res <- evolve_model(cdata, seed = seed3)

`evolve_model`

evolves the models on training data and then, if a test set is
provided, uses the best solution to make predictions on test data. Finally, the
function returns the GA object and the decoded version of the best string in
the population. Formally, the function return an `S4`

object with slots for:

`call`

Language from the call of the function.`actions`

Numeric vector with the number of actions.`states`

Numeric vector with the number of states.`GA`

S4 object created by`ga()`

from the`GA`

package.`state_mat`

Numeric matrix with rows as states and columns as predictors.`action_vec`

Numeric vector indicating what action to take for each state.`predictive`

Numeric vector of length one with test data accuracy if test data was supplied; otherwise, a character vector with a message that the user should provide test data for better estimate of generalizable performance.`varImp`

Numeric vector with length equal to the number of columns in the state matrix, containing relative importance scores for each predictor.`timing`

Numeric vector length one containing the total elapsed time it took`evolve_model`

to execute.`diagnostics`

Character vector length one, designed to be printed with`base::cat()`

.

Use the summary and plot methods on the output of the `evolve_model()`

function:

summary(res) plot(res, action_label = ifelse(action_vec(res)==1, "C", "D"), transition_label = c('cc','dc','cd','dd'))

The diagram shows that `evolve_model`

recovered a tit-for-tat model in which
the player in question ("me") mimics the last action of the opponent.

Use the `estimation_details`

method on the output of the `evolve_model()`

function:

suppressMessages(library(GA)) plot(estimation_details(res))

`evolve_model`

Check out the documentation for the main function of the package to learn about all the options:

```
?evolve_model
```

This work is supported by U.S. National Science Foundation grants EAR-1416964 and EAR-1204685.

```
sessionInfo()
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.