Power in acceptability judgment experiments and the reliability of data in syntax


The literature on acceptability judgment methodology has been recently dominated by two trends: criticisms of traditional informal judgment experiments as coarse and unreliable, and endorsement of larger, formal judgment experiments as more sensitive and reliable. In order to empirically investigate these claims, we present a systematic comparison of the statistical power of two types of judgment experiments: the Forced-Choice task, which is traditionally used in informal syntactic experiments, and the Magnitude Estimation task, which is traditionally used in formal syntactic experiments. We tested 48 pairwise phenomena spanning the full range of effect sizes found in a recent large-scale empirical survey of the core phenomena of syntactic theory (Sprouse & Almeida, submitted), deriving estimates (via resampling simulations) of statistical power for each phenomena for sample sizes from 5 to 100 participants. The results show that (i) contrary to recent criticisms, Forced-Choice experiments are generally more powerful than formal Magnitude Estimation experiments at detecting differences between sentence types, and that (ii) even under the most conservative assumptions, Forced-Choice experiments with small sample sizes achieve the “best practice” guideline of 80% statistical power (established for experimental psychology and the social sciences) for 95% of the phenomena in syntactic theory. We also compare the standardized effect sizes of the syntactic phenomena with phenomena in other domains of experimental psychology, and show that the former are, on average, four times larger than the latter. These results suggest that well-constructed, small-scale, informal syntactic experiments may in fact be among the most powerful experiments in experimental psychology.

Unpublished manuscript