Trying out different versions of binomial simulations for p. 174

Aaron Ellison wrote:

On the footnote to p. 174, you talk about simulating 1000 samples (binomial p = .75, n =10, size = 4) and comparing the ml estimate to the observed value (-7.57). You find that the binom pval = 0.22. So I tried to duplicate this, and silly me, tried to use the mle2 function, from which I couldn't easily extract the -log likelihood (how do I get individual elements from an S4 object?). So instead I used the optim function, which being S3, from which I could extract the wished-for value as optim(…)$value.

I looked at this and realized that I had made a mistake in the way I ran the simulation. Define a binomial negative log-likelihood function:

binomNLL1 = function(p, k, N) {
     -sum(dbinom(k, prob = p, size = N, log = TRUE))

Load data: calculate maximum likelihood of data.
data(ReedfrogPred, package = "emdbook")
x <- subset(ReedfrogPred, pred == "pred" & density == 10 & 
    size == "small")
k <- x$surv

We know analytically (eq 6.2.6, p. 171) that the maximum likelihood estimate of $p$ is the total number of tadpoles killed (30) divided by the total number exposed (40).
(p.mle <- sum(k)/40)
[1] 0.75
(llik <- sum(dbinom(k, size = 10, prob = p.mle, log = TRUE)))
[1] -7.571315

Just as a side note - we will get the same maximum likelihood estimate, but a different (much higher) log-likelihood value, if we compute the results as though we had done a single big experiment, rather than 4 separate ones:
> dbinom(sum(k), size = 40, prob = p.mle, log = TRUE)
[1] -1.935415

This computation doesn't take into account the division of successes and failures into particular trials. Here's what I did:
ptry <- function() {
     x <- rbinom(4, prob = 0.75, size = 10)
     sum(dbinom(x, prob = 0.75, size = 10, log = TRUE))
rprobs <- replicate(1000, ptry())
(binom.pval <- mean(rprobs < llik))
[1] 0.225

This simulates the likelihood at the maximum likelihood value for the original data. This is wrong.
Here's (a version of) what Aaron Ellison did:
ptry2 <- function() {
     simdata <- rbinom(4, prob = 0.75, size = 10)
     -1 * (optim(fn = binomNLL1, par = c(p = 0.75), N = 10, k = simdata, 
         method = "BFGS")$value)
time1 <- system.time(sim.mle <- replicate(1000, ptry2()))

A more efficient way to do the same thing - since we can get the MLE of the per capita probability without using optim in this case:
ptry3 <- function() {
     simdata <- rbinom(4, prob = 0.75, size = 10)
     sum(dbinom(simdata, prob = sum(simdata)/40, size = 10, log = TRUE))
time2 <- system.time(sim.mle2 <- replicate(1000, ptry3()))

The second way takes 0.024 instead of 1.58 seconds. The answers aren't identical, but they're pretty close:
max(abs(sim.mle - sim.mle2))
[1] 5.893136e-08

The p-value we get in this way is
mean(sim.mle < llik)
[1] 0.124

Arguably, both of these are wrong: what we really want to do is to take multinomial samples from a sample of size 30, with equal probabilities of the tadpoles eaten being in any of the 4 trials. Actually, it's worse than that. We can't take an arbitrary multinomial sample, we also have to condition on there being fewer than 10 successes in each trial … Pick multinomial samples, transpose to get rows corresponding to trials (pick more than we think we will need, because we will have to throw out rows with $>10$ tadpoles in any trial):

simvals <- t(rmultinom(10000, size = 30, prob = rep(0.25, 4)))

Keep only rows where all trials have $\leq 10$ tadpoles eaten:
simvals2 <- simvals[apply(simvals <= 10, 1, all), ]

Calculate likelihoods and p-values:
sim.mle3 <- rowSums(dbinom(simvals2, prob = 0.75, size = 10, 
     log = TRUE))
mean(sim.mle3 < llik)
[1] 0.5385784

Just out of curiosity, what are the possible outcomes? To figure this out, sort each row and then paste it together as a comma-separated string.
simchr <- apply(t(apply(simvals2, 1, sort)), 1, paste, collapse = ",")
[1] 22

Now try it just with the values that are more likely than the observed values:
simchr2 <- apply(t(apply(simvals2[sim.mle3 > llik, ], 1, sort)), 
     1, paste, collapse = ",")
[1] 6
[1] "5,8,8,9" "6,6,9,9" "6,7,8,9" "6,8,8,8" "7,7,7,9" "7,7,8,8"

Other less efficient (?) but equivalent (?) ways of doing this include simulating binomial data as above but throwing out results where $k \neq 30$, or sampling without replacement from the remaining tadpoles, one trial at a time. There might also be a way to do this analytically - go back and look at the derivation of Fisher's exact test??? For this case it might also be possible to enumerate the possible cases exhaustively (since there are only 22), calculate the absolute likelihood of each, and correct by the sum of the likelihoods (this is obviously not feasible for even slightly larger samples). I'm still not sure whether conditioning on the observed number of tadpoles eaten is correct or not. Arguably Aaron's (simpler to calculate!) answer is correct, but it's interesting to think about …
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License