- if your network connection is wonky (we were working in a room with spotty wireless connectivity) you will have trouble choosing a mirror (if R is trying to retrieve the full list of mirrors from
`cran.r-project.org`), retrieving a list of available packages, or downloading packages. - after you use
`install.packages()} to install a package (or use the Packages menu), you still need to use {{library()`to load it before you can use any of the functions in the package or access help on any of those functions. - math notation: you must use *, not juxtaposition, to indicate multiplication; you must use ^, not superscripting, to raise a number to a power.
- when in doubt, try it! if something has gone wrong at the end of a long list of commands, go back and look at the intermediate results. It is very difficult to break R by running a bad command, so if you're wondering what a particular command would do, go ahead and see what happens.
- Be careful with parentheses. Continuation characters (+) most often indicate that you have forgotten to close a parenthesis. However, just because the expression is complete doesn't mean it is correct.
- I have a note, "dynamic", scribbled to myself, but I don't know what it means.

by bbolker

]]>Here are some quick (and hopefully, interesting) thoughts on two separate kinds of pronoun trouble that have come up in my graduate teaching recently, (1) the ever-controversial "singular they" (i.e., using "they" to denote a single person of indeterminate gender) in referring to author(s) of a scientific paper and (2) use of the first-person singular in single-authored papers.

(continue)

Grammarians have disagreed for decades whether (when?) it is acceptable to refer to a person of unknown gender as "they", or whether in this situation you should

- say "he" [the 'default' gender in English] or
- alternate haphazardly (?) between "he" and "she"
- reword the sentence to refer to multiple people, so you can use the pronoun "they" in its plural sense. (See Wikipedia's entry on the subject for a brief introduction.)

Anne Fadiman has a whole essay on the subject (compromises between style and inclusiveness) called "The His'Er Problem" in her lovely book of essays [1], and the linguistics blog Language Log has a whole section devoted to the topic (they're OK with it).

The (rather tenuous) connection with my book is that on p. 16 of the book I wanted to say

a Bayesian would say that there was good evidence against the hypothesis: even more strongly, they could say …

but the copy editor made me change it to

Bayesians would say … they could say …

She was unconvinced by my citation of various web pages that argued in favor of singular "they". I was disappointed because the singular form feels stronger — the reader gets the impression of a single (gender-indeterminate) Bayesian analyzing data, rather than an indeterminate group of Bayesians who might not even be sitting in the same room together. (If I had known that some of the Language Loggers have published a dead-tree version of their blog entries [2], I could have used it for support.)

In classroom discussions, students often use singular "they" to refer to the authors of (single-authored) papers. Ironically (since I prefer singular "they" in my own writing), this bothers me — not because of grammar, but because I often know the authors of the articles, and using "they" for someone whose gender is (or could be) known depersonalizes them. The author is not some mysterious "they"; they (sic) are a human being, with the usual array of strengths, weaknesses, prejudices, blind spots, etc. as other human beings. Referring to the author as "they" rather than by (his/her) appropriate gender is one more barrier to putting yourself in (his/her/their?) shoes, and to seeing that you could in principle do the same kind of work — that real science is done by (reasonably) normal people.

In a more gender-related vein, I also have pronoun troubles in referring to the early work of Joan Roughgarden; Joan is a brilliant transgendered theoretical ecologist/evolutionary biologist. What pronoun should I use to refer to the person who wrote *Theory of Population Genetics and Evolutionary Ecology* (1995) [3]?

Another solution in this determinate-person-with-indeterminate-gender case, which extends to gender-uncertain foreign authors with unfamiliar names, or Anglo authors named Robin, Kim, or Lynn (or who are cited by their initials), is to avoid pronouns altogether: "Roughgarden says", although this becomes awkward when repeated too often (that's why we have pronouns in the first place).

Another grammar-related, but conceptual issue: what voice should one use in a single-authored paper? There are again a variety of unsatisfactory compromises:

- Most scientific style guides recommend against the obvious choice of first-person singular (because science is objective?)
- It seems wrong to use the first-person plural when you're a single person: according to Mark Twain, 'Only kings, presidents, editors, and people with tapeworms have the right to use the editorial "we."'

Charles Van Way writes:

First is what we might call the "scientific we." It is sort of like the "royal we" used by the Queen of England and other monarchs. I find it amusing that most scientific writers will not use the word I. The justification is that most papers are written by multiple authors. Actually, use of the first person plural is very acceptable. It often allows the author to say something more simply and directly than otherwise …

Use of the third person is universal in scientific papers. There is nothing wrong with this. It is the appropriate person to use for expository prose, after all. It is not the use of the third person per se that makes writing tedious. But the third person can be used in either the active voice or the passive voice, and this makes a great difference. The passive voice is the source of much bad writing in the scientific literature.

- In my book, and in some mathematical writing, "we" implicitly refers to the author and the reader together. On p. 16 I say "We’ll explore Bayes’ Rule and revisit Bayesian statistics in future chapters." This voice is very effective, drawing in the reader as a participant in the process, when you can make it work. Maybe it's easier in mathematics because the reader could in principle follow along with the work, doing the derivations in parallel, if they wanted to, whereas that's pretty much impossible with lab or field work.
- The passive voice generally weakens the force of writing: my own feeling is that it's just fine (indeed preferred) in Materials & Methods sections (where you want to avoid agency), but otherwise you should avoid it when possible.
- Rewriting everything to avoid the first person in more general ways is less ugly than substituting the passive voice, but it still reduces agency. You did the work in the paper: why not say so? (This overlaps with Larry Weinstein's discussion of agency in
*Grammar for the Soul*[4], which you can read on Google books — scroll down to p. 32 if necessary: there's a delicate balance between taking credit, and responsibility, for your ideas and writing in a way that says "me me me".)

Still, I'd like it if we could weaken the virtual prohibition on the first-person singular in scientific writing. (If I were braver I would take steps in this direction by using it in my own single-authored papers.) It would open more options for single authors of scientific articles.

Bibliography

1. Fadiman, Anne. 1998. Ex Libris. Farrar, Straus and Giroux.

2. Liberman, Mark, and Geoffrey K. Pullum. 2006. Far from the Madding Gerund and Other Dispatches from Language Log. 1st ed. William, James & Company.

3. Roughgarden, Jonathan. 1995. Theory of Population Genetics and Evolutionary Ecology: An Introduction. Facsimile. Benjamin Cummings.

4. Weinstein, Lawrence A. 2008. Grammar for the Soul: Using Language for Personal Change. Quest Books, April 25.

by bbolker

]]>(continue)

Well, Mazerolle is right (that Anderson says weights are probabilities): I don't have the book (although I should probably buy it), but I was able to look the page up in Google books. Scary as it is for me to criticize such a smart guy/big shot, I think Anderson is being sloppy here (which is unusual): he starts out by saying

These weights are also Bayesian posterior model probabilities (under the assumption of savvy model priors) …

(which is correct) but then slides into the sentence quoted above,

A given $w_i$ is the probability that model $i$ is the expected K-L best model

As far as I know, this statement is only true with that particular choice of "savvy" priors (which IMO are fairly odd — you can read about them in Burnham and Anderson's work), and is only true asymptotically (i.e. when the data set is very large: I'm not sure about this, but I think so, by analogy with other criteria like the BIC).

In thinking about the bottom line, I thought of the phrase "you can't have your cake and eat it too" — which as it turns out is *exactly* what I said in the original footnote referring to this mistake:

Taking AIC weights as actual probabilities is trying to have one’s cake and eat it too; the only rigorous way to compute such probabilities of models is to use Bayesian inference, with its associated complexities (Link and Barker, 2006).

So there!

Bibliography

1. Mazerolle, Marc J. 2006. Improving data analysis in herpetology: using Akaike's Information Criterion (AIC) to assess the strength of biological hypotheses. Amphibia-Reptilia 27 (March): 169-180. doi:10.1163/156853806777239922.

2. Anderson, David Raymond. 2008. Model Based Inference in the Life Sciences: A Primer on Evidence. Springer. http://www.springer.com/life+sci/ecology/book/978-0-387-74073-7.

by bbolker

]]>I will review the criticisms of dynamite plots, which I generally agree with, but then want to put forward a couple of their advantages, and suggest that the generally favored Tukey box-and-whisker plot is not a universal solution to graphical problems.

(continue)

(these are mostly taken from the Vanderbilt wiki page on the topic)

- they have a low "data-to-ink" ratio
^{1} - they hide the raw data — the plot shows only the means and standard errors (or 95% confidence intervals) of the groups.
- they assume symmetric confidence intervals
- the previous two issues both stem from dynamite plots' strong parametric assumptions about the data. This document by Frank Harrell gives an example where a dynamite plot conceals strong bimodality (the existence of "responders" and "non-responders" within a single treatment group).
- dynamite plots that show only an upper whisker make it harder to compare groups
- This page says dynamite plots "cause an optical illusion in which the reader adds some of the error bar to the height of the main bar when trying to judge the heights of the main bars" (unfortunately it doesn't provide a reference)

- they are comfortingly familiar to most ecologists and other biologists
- they anchor the data at zero (but see below about whether this is a good thing or not)
- they provide more area for displaying colors or gray scales than plots that use a single point for each value: this may (?) make it easier to distinguish groups of treatments in the data

- Should plots be anchored at zero? If the differences among group means, and/or the standard errors/confidence interval ranges, are small relative to the absolute magnitudes of the responses (e.g. means of two groups equal to 10 $\pm$ 0.01, 10.05 $\pm$ 0.02) it becomes very hard to see the differences. (If the point you want to make is that the results are statistically significant but very small in magnitude, perhaps this is a good thing …) In my opinion, if the data are not naturally anchored at zero (e.g. data shown on a logarithmic scale, or data in a very small range) then dynamite plots just don't make sense.
- Would you prefer to display the structure of the data, or the structure of statistical inferences on the data? The advantage of the generally recommended dot plot (for small data sets), boxplot (for medium data sets), or violin plot (for large data sets), is that they show what's going on in the data. However, they don't show anything about the confidence intervals on the mean (for example), although notched box plots (McGill et al. 1978: see figure above) give an approximate indication of significantly different medians.
- Do you want your plot to match the structure of the statistical test(s) you carried out? Standard ANOVA, for example, assumes normal errors with equal variances across groups. In that case, you presumably already did the exploratory data analysis (including graphing) to confirm these assumptions are reasonable. The assumptions of the dynamite plot (symmetric confidence intervals, reasonable distribution of points within groups) match the assumptions of your analysis. On the other hand, maybe you should show the data in the exploratory (dot/box/violin) format so that your readers can assess the validity of the assumptions for themselves rather than taking your word for it … or at least show the per-group standard errors rather than the pooled estimate of the standard error (as is done in the figure above). This problem gets even harder for complex statistical designs. How do you represent (for example) spatial dependencies in the data in a plot of the means?

I used the `OrchardSprays` data set from R (originally from Finney 1947, by way of McNeil 1977), which is used as an example in the `stripchart` documentation.

Some thoughts:

- the data are actually plotted on a log scale (see the code), which automatically (to my mind) rules out dynamite plots as a good idea. On the other hand, the data appear fairly well behaved (so we may not really need to see the details of their distribution)
- the data set is small (8 points per treatment), so boxplots and violin plots are probably not a good idea; you can see that some of the notches are turned inside out because the notch extends beyond the lower and upper "fences" (approx. 1st and 3d quartiles)
- the first row shows more "inferential" plots (based on means and standard errors), the second shows descriptive or exploratory plots; in the third plot, I tried to add more information by mimicking the Bayesian posterior plots from Gelman and Hill's book or their
`arm`package for R. Their plots show the 50% and 95% (I think) credible intervals for each group; I show $\pm$ 1 and 2 standard errors (could also do this with 50% and 95% frequentist confidence intervals based on the t distribution). - these plots are all done in base R graphics (with the help of the
`vioplot`and`gplots`packages from CRAN). They could all be done, with more or less effort, in lattice graphics instead.

Here's the code.

*Updates*: Frank Harrell mentions `panel.bpplot` from the `Hmisc` package (e.g. see the R graphics gallery). These "box-percentile" plots offer a range of options combining the advantages of dotplots (showing the raw data) and violin plots (showing various amounts of detail about the shape of the distribution). (Still doesn't address the issues of description vs. inference, though.)

*More updates:*

Here are the same plots coded with ggplot:

See also: this blog post comparing box & violin plots.

by bbolker

]]>On the other hand, B&A make a compelling argument that BIC was developed to identify the "dimension" or "true number of parameters" of a model, and that this is rarely sensible in ecological modeling contexts because of what B&A call *tapering effects*. That is, if you have a large number of predictor variables some of which have non-zero effect sizes (e.g. regression coefficients) and some of which have zero coefficients, the BIC is trying to tell you how many are non-zero. B&A point out that it is more common that there will really be a few coefficients with large magnitude, more with smaller magnitude, even more with tiny magnitudes … and that all of the predictor variables really have **some** effect on the response, albeit very small. What we should be trying to do, they say, is identify how many parameters are useful for prediction rather than how many are non-zero. (This also agrees in general with the Bayesian argument against point null hypotheses, i.e. that parameters are never exactly zero — somewhat ironic, since it suggests that Bayesians would prefer AIC over the "Bayesian" Information Criterion.)

The bottom line: I would say the AIC is generally the right choice for ecological questions, over BIC, unless you're really trying to identify a specific number of components. (People do this in time-series analysis, to try to identify the number of time lags or interacting species, although I think they probably shouldn't — the "tapering effects" argument really applies here.)

by bbolker

]]>Brian Dennis commented to me at ESA today that he doesn't agree with my categorization of AIC etc.-based methods (what I call "model selection") as not being frequentist. I see what he means, but I think the distinction is a bit subtle. Here's a fuller explanation of his point and my feeling about it:

- AIC is actually an estimate of the
*expected*difference in the "K-L discrepancy" between two models, that is of the difference in the distances between those models and the true model (there are further deep waters here about whether we really need to assume that a "true" model exists at all, but I'm going to skim over them). Therefore, it is an estimate of the change in distance**on average, across many hypothetical realizations of the real world**— that is, it is really a frequentist point estimate of the change in distance. - However, when you actually use AIC you don't say anything about probabilities, or think about averages across many realizations — you just say "this model is estimated to be better than this other model". So it definitely doesn't feel like a frequentist method, since you aren't making statements about the fraction of the time that some specified outcome would happen in many repeated trials (i.e., probability, according to the frequentist definition). So I thought it was easier when introducing it not to call it a frequentist method.
- The challenge with all these sorts of things is to give a definition that (1) most people can understand and (2) is
*technically*correct even if it glosses over some details. I think I might have failed on criterion #2 here, I will try to find a better way to describe it that still gets the point across.

(Updated 6 August, BD kindly suggested some changes.)

by bbolker

]]>I missed Aaron Ellison and Brian Dennis's talk today at ESA about "what literate ecologists should know about statistics" (I opted for Gordon Fox's talk about diversity gradients instead). It would have been at least culturally interesting since BD is a staunch frequentist and AE is a hard-core Bayesian. (Interestingly, most statisticians have got beyond having these arguments. I apologize if either of the authors disagree with my characterization of them.) I mostly agreed with what I saw in the abstract, although I don't draw as bright a line as they appear to between "model-based" and "design-based" statistics. Here's an excerpt:

we suggest that literate ecologists at a minimum should master core statistical concepts, including probability and likelihood, principles of data visualization and reduction, fundamentals of sampling and experimental design, the difference between design-based and model-based inference, model formulation and construction, and basic programming. Because mathematics is the language of statistics, familiarity with essential mathematical tools – matrix algebra and especially calculus – is a must and will facilitate collaborations between ecologists and statisticians.

Maybe I'll post more here if they would like.

[Update August 5: Aaron said something today about publishing it in *Frontiers in Ecology* (IIRC).]

by bbolker

]]>I really enjoyed reading Breiman 2001 (Statistical Science 3:199-215), in which he says (approximately) that statisticians are too concerned with "true models" and that they should focus on descriptive, flexible, nonparametric models (random forests, generalized additive models, etc etc) instead. Really can't do these topics justice here, but it just reminds me that there's a whole other school (or something) of statistical thought. Philosophically, it ties in with Peters' Critique for Ecology and (perhaps) Jim Clark's recent claims that we should give up on "low-dimensional" models and embrace nature in its noise and complexity [ok, perhaps a bit unfair but I'm in a hurry]. Reasons I haven't focused on such topics in the book: (1) I don't know them that well, so I can't really teach them to others; (2) they are much more complicated computationally, thus leading to a "user" perspective rather than a "builder" perspective (not necessarily a bad thing but not what I was aiming for); (3) they are typically very data-hungry, unsuited to many (but certainly not all) ecological data sets; (4) I like tying models to ecological theory. John Drake at UGA is from the Breimanish (Breimanesque) school, don't know if he has any interesting perspectives written on the topic …

*Further thoughts* (a few hours later): one way to compare the two statistical approaches is that the "classical" approach is trying to *explain* or *test*, while the more "modern" (all in quotation marks) approach wants to *describe* and *predict*: as I comment in the book, while these questions often have similar answers, they imply different statistical approaches — and the answers are not always the same. Another related topic is *semi-mechanistic* models, which Steve Ellner and Simon Wood have variously championed. More later (?)

by bbolker

]]>