This article is the first in the occasional “Distinguished Educator” series appearing in the 2003-04 volume year of The Reading Teacher. Other articles in this series will be reprinted in Reading Online as they are published.


A Few Things Reading Educators Should Know About Instructional Experiments

Michael Pressley

For a printer-ready version of this article, click here. Use your browser’s “back” button to return to the Reading Online site.


I am confronted almost daily by the 2001 No Child Left Behind (NCLB) legislation and its demand that reading instruction be scientifically evidence based. Assertions are also being made that reading instruction provided to children in the United States should be evaluated in randomized experiments, especially reading instruction supported by public funds, such as those provided by the Reading First program.

As an experimental psychologist, I have contributed many "true" experiments (i.e., experiments with random assignment to condition) to the scientific literature. I am proud that I have mastered the ability to conduct research experiments. I am also proud that as part of my education I learned a great deal about the strengths and weaknesses of experimentation—when experiments are credible and useful and when they are not.

Many of the contemporary assertions about experimentation and reading instruction concern me. Many claims are debatable at best and sometimes just plain wrong. I spoke about my concerns at the 2002 International Reading Association Convention. At the conclusion of that address, the editors of The Reading Teacher requested that I prepare an article instructive to teachers about experimentation and evidence-based reading instruction.

I present in this article 12 points about experimentation that I think all teachers concerned with evidence-based reading instruction should keep in mind as they evaluate the flurry of assertions that are overflowing the marketplace of ideas in literacy education. After presenting these points, I reflect briefly on how responsible experimenters who wish to inform literacy instruction should communicate research results to classroom practitioners.

  Related Postings...
From the ROL Archives


From the International Reading Association Website


Twelve Points to Remember

In reflecting on these 12 points, keep in mind that whole books are written about experimentation, and whole books could be written about experimentation in literacy education. I think that literacy educators should read much more than this article to become informed about how to interpret and use experimental research to enlighten their classroom practices. Perhaps what follows will whet the appetites of at least a few readers to do exactly that.

Experimentation provides cause-and-effect conclusions not possible with other research methods

There is no doubt that experimentation, when it is possible to randomly assign participants to differing forms of instruction, permits insights on instruction as a cause of achievement better than any other methodology. There is no substitute for the randomized experiment as a powerful window on cause-and-effect relationships.

For example, if a researcher is interested in testing some new type of phonemic awareness instruction against an existing type and can randomly assign kindergarten classrooms to receive either the new type of instruction or the existing one, it is possible to decide whether the new type of instruction is more effective. To do this, we assess phonemic awareness at the beginning of the kindergarten year, before students have experienced any instruction, with some formal instrument. If at that point phonemic awareness is equivalent in the two classrooms, an interpretable experiment can begin. After students receive instruction (i.e., in the late spring of the kindergarten year), if phonemic awareness is greater in classrooms receiving the new type of instruction than in ones receiving the existing instruction, the best bet is that the new type of phonemic awareness instruction caused the difference. If different researchers try differing versions of the new intervention and it consistently produces better achievement, confidence in the new intervention increases. Replication often is a good thing, but not always.

Replications of the effects of an instructional intervention are not always good

The National Reading Panel (NRP) did a good job of reviewing the research on phonemic awareness instruction. The NRP identified 58 experiments and concluded that such instruction increases students' phonemic awareness and subsequent beginning reading achievement. Even so, they urged more such studies. Members of the NRP seemed to have failed to notice that there comes a point of diminishing returns—that after a few replications of an effect, not much is learned from additional tries. Perhaps we could live with such diminishing returns if experiments were not costly, but they are. If what is being tested is a form of instruction that takes weeks or months to implement, such as most forms of phonemic awareness instruction, the costs can be quite high. That would not be such a problem if there were an abundance of funded research support in reading and reading education, but that is not the case.

A preferred methodology for summarizing research, meta-analyses, depends on many replications. In order to perform a meta-analysis, there must be a large number of experimental comparisons to put into the analysis, with each replication estimating how big an effect one form of instruction produces. A meta-analysis produces an arithmetical average of all the individual experiment effect sizes for the many replications. That the meta-analysis depends on many replications, however, could be interpreted as a fatal flaw of the methodology. It seems increasingly so in my mind, because the most important conclusions to be drawn from meta-analyses are almost always ones already known to anyone who has read the literature or even part of the literature about the instruction being assessed. That was certainly the case with respect to the NRP's findings on phonemic awareness instruction. Anyone who had contact with that literature knew that phonemic awareness could be affected through instruction in ways that promoted later reading achievement.

Determining that instruction is effective only if its effects are replicated across a number of studies (i.e., there are enough replications to do a meta-analysis) is a decision to ignore the many instructional effects that have been studied in only one or a few investigations. Such a decision can also lead to the misconception that there are only a few forms of instruction that enjoy scientific validation through the rigorous application of experimentation.

Many forms of reading instruction enjoy some experimental validation

Often those shouting most loudly about the benefits of experimental data do so to support only a few reading instructional interventions—usually only those reviewed by the National Reading Panel (e.g., phonemic awareness instruction, phonics, vocabulary teaching, comprehension instruction). There are many more reading instructional interventions that have promoted literacy achievement that enjoy at least some experimental support. These include interventions during the preschool years—some as focused and universally available as Sesame Street (e.g., Ball & Bogatz, 1970; Bogatz & Ball, 1971; Rice, Huston, Truglio, & Wright, 1990), others more diffuse and scarcer, such as high-quality preschool (e.g., Consortium for Longitudinal Studies, 1983; Lazar, Darlington, Murray, Royce, & Sipper, 1982).

Once children make it to elementary school, every classroom must be managed, and those classrooms that are managed better have a positive impact on reading achievement (Emmer, Evertson, & Anderson, 1980). Not every classroom includes cooperative learning, but those that do so have a more favorable impact on literacy achievement than those that do not (Sharan, 1980; Slavin, 1990). Expensive one-on-one tutoring programs staffed by professionals affect beginning reading achievement (Pinnell, 2000), but so do other less-expensive volunteer programs and even peer-tutoring approaches (Invernizzi, 2001). Small class size matters in beginning reading achievement (e.g., Achilles, Finn, & Bain, 1997). I could go on and on. Anyone who advances the argument that scientifically supported reading instruction boils down to only the instruction highlighted by the NRP (i.e., phonemic awareness instruction, phonics, vocabulary teaching, comprehension strategies instruction) should go to the library and begin reading the research literature. There are many reading instructional interventions that enjoy some support in true experiments.

Sometimes one experiment is all we will have in the foreseeable future

There are some studies of important instructional interventions that are never going to be replicated because of the cost. For example, consider my own work on transactional comprehension strategies instruction, a long-term form of comprehension strategies instruction. This form of instruction starts with teacher modeling and explanation of a small repertoire of comprehension strategies (e.g., prediction based on prior knowledge, self-questioning, construction of mental images, summarization). Students in small groups practice using their repertoire of comprehension strategies in a think-aloud fashion. The theory is that eventually students internalize the strategies and actively use them when reading in the ways that advanced readers do (Pressley & Afflerbach, 1995).

Teachers require substantial training before they can deliver transactional comprehension strategies instruction, and its delivery occurs over semesters and years. My colleagues and I (Brown, Pressley, Van Meter, & Schuder, 1996) managed to do one very well-controlled quasi-experiment of transactional strategies instruction (TSI) at the grade 2 level. The study took more than a year to plan and a year to execute, and it took more than a year to analyze and write up the results. I will be surprised if there is even a single replication of the study, and I am unapologetic in arguing that such carefully conducted efforts should be taken seriously in discussions aimed at reforming and improving primary-grades reading instruction. Fortunately, the NRP seemed to agree and indicated that TSI is a promising new direction in comprehension strategy instruction. Just think for a moment how many forms of instruction require long-term teacher development before teachers can deliver them well. There are many, and to decide to ignore or discount well-conducted experimental analyses of such interventions because they are few is negligence we can ill afford.

Experimentation is not a good way to test published reading programs

Despite many policymakers' conviction that reading instruction should be evaluated in true experiments, few, if any, adopted school reading programs have been evaluated. To my knowledge, none of the published, comprehensive reading instruction programs have been subjected to evaluation in true experiments. Rather, most comprehensive reading instructional programs that can legitimately claim to be evidence based include elements of instruction that have been evaluated in true experiments and have been found to produce reading achievement gains. Thus, the instruction in many published comprehensive reading instructional programs is placed into the program, in part, because similar instruction evaluated in true experiments has produced gains in reading.

Critics of published comprehensive reading instructional programs have begun to argue that being evidence based in this sense is not good enough. If comprehensive reading instructional programs want to be considered scientifically based, they should be evaluated in randomized true experiments. I do not agree. The design, execution, and reporting of such experiments would take a minimum of two to three years, maybe longer. By the time such experiments were concluded, the specific published programs under evaluation would be out of print, replaced by revised and new comprehensive reading instructional programs. The net effect would be well-controlled evaluations of products that were disappearing from the marketplace.

It makes sense to conduct experiments on interventions and instructional practices that are likely to endure. When it is certain that a comprehensive reading instructional program will be replaced with the next state adoption cycle, there is no justification for expending the resources required to perform a true experiment. That said, from time to time an experiment is conducted using one of the comprehensive programs as a treatment. It is ironic that, when that has occurred in the past, the scientists were not really interested in evaluating the program as a whole but were interested in a single instructional component found within the program, and they claimed that the program benefits were due to the single instructional component of interest to them. Do not believe claims about single instructional components that emerge from evaluations in which whole programs are tested against one another.

Comprehensive literacy programs combine a number of instructional components

One such study comes to mind again and again. It involved a comparison of four beginning reading instructional programs, each of which included some form of word-recognition instruction, literature, writing, and so on. Only one program included synthetic phonics. When that program produced greater effects than the other three, the researchers concluded it was due to the synthetic phonics. The problem, of course, was that the students in that treatment also received literature and writing instruction unique to that instructional program, along with a number of other bells and whistles not in the other packages. There was no way to isolate the effects of the synthetic phonics component. In general, that is typical when a large program is evaluated in an experiment. It is impossible to know which of the program's many components produced effects that are observed.

Researchers should be knowledgeable about the instruction under evaluation

There are researchers performing experiments on instructional programs about which they know very little. They are professional experimenters rather than informed instructional researchers. I harbor deep concerns about whether such professional experimenters can do a credible job evaluating reading instruction, especially complex reading instruction. It takes thorough familiarity with a complex form of instruction to design a credible, analytical experiment to evaluate it. Again, consider my work on transactional strategies instruction, described earlier (Brown et al., 1996).

That comparative quasi-experiment was informed by several years of ethnographic research in schools and classrooms. Immersion in the instruction in advance of designing a formal quasi-experiment permitted us to identify teachers who delivered the instruction well versus control teachers who delivered excellent instruction that did not include transactional comprehension strategies. In addition, it allowed my colleagues and me to identify a range of dependent variables that tapped a variety of effects produced by the intervention. The result was a compelling experiment, one substantially more informative than if we had attempted to do a study of transactional strategies instruction without the years of immersion in the intervention. When evaluating the quality of an experimental study, it is important to ask whether the experimenter was familiar enough with the instruction in advance of the experiment to design a good study. Such familiarity makes it more likely that the instruction will be delivered with integrity. Such work in advance of instructional experimentation is also essential if instructional research is to have external validity.

External validity matters

An externally valid study has real-world credibility. There are many instructional experiments that fall short on this criterion. Perhaps the most obvious failures of external validity involve instruction that is just not like the instruction that would occur in school. Failure can occur for a number of reasons. Sometimes, as suggested in the last section, researchers may not be sufficiently familiar with the intervention being studied. An underinformed experimenter cannot implement such instruction as well as those who designed it can—or those who have experience delivering it in classrooms and schools. Other times, experimenters decide to simplify the instruction, perhaps for reasons of cost. For example, the researchers cannot afford a whole year of transactional strategies instruction, which would instruct students and coordinate the use of a half dozen or so comprehension strategies. Instead, the researchers compromise by teaching only two strategies over a couple of months, providing students with a few weeks of practice using the strategies together.

When someone other than students' teachers provides instruction (e.g., the experimenters themselves), experimental external validity is threatened. In my experience, such teaching rarely resembles what goes on in actual classrooms.

When instruction is provided by teachers who are poorly prepared to deliver it, experimental external validity is once again threatened. Often, such a decision is made in the service of internal validity—that is, having a truly randomized experiment. The problem with randomly assigning teachers to instructional conditions is that it often results in instruction provided by teachers who (a) do not understand the form of instruction they are delivering or (b) are not committed to delivering the instruction because they do not believe in it.

External validity is important enough that sometimes internal validity should be compromised for it. In particular, when a form of instruction can be delivered only by teachers who know the instruction well, it makes little sense to randomly assign teachers to instructional conditions. If it is possible to assign students randomly to teachers, the internal validity problem usually is solved. If it is not possible, then the best way to do a comparative study might be to conduct a quasi-experiment—a study comparing achievement produced by teachers who are familiar with the target instruction and committed to it versus achievement produced by teachers who use a different strategy (Cook & Campbell, 1979). Such a quasi-experiment can work if, before the study begins, there is careful matching of classes with respect to achievement (i.e., achievement is the same in the classes receiving the target instruction as in the comparison classes receiving another form of instruction).

Another threat to the external validity of an experiment deserves attention. Strong assertions are currently being made about how instructional effects should be measured in instructional experiments.

Standardized tests are not a gold standard

Given the importance of standardized tests in the current accountability environment, they often show up as the measures of dependent variables in many instructional experiments and quasi-experiments. My perspective, however, is that standardized achievement or reading survey tests do not reveal much. Standardized instruments are an aggregation of measures, a collapsing across a variety of items tapping many different processes. When an instructional intervention affects a standardized test score, little is really found out about how reading was affected—how the reading process was changed by instruction. To determine specific instructional effects, more direct measures of process are required. Thus, studies focusing on word-recognition processes should include measures tapping those processes as directly as possible (e.g., if synthetic phonics is being taught, reading of nonsense words can reveal whether children can sound out and blend sounds). If comprehension strategies are being taught, measures of comprehension strategies make sense (e.g., think-alouds during oral reading of a story). Thus, in the comparative study of transactional strategies instruction, standardized reading performance measures were used to satisfy those concerned with conventional accountability. Measures tapping the comprehension strategies of readers as well as the effects of the instruction on the richness of retellings were also used. Such measures of process illuminated how the instruction affected reading processes more directly than standardized test data. I do not see enough of such analytical and specific measurement in many experiments on reading instruction.

Of course, diverse and specific measurement is just one way to increase the external validity of a study. In evaluating the external validity of experiments, it is crucial to examine who is studied.

Experiments should include students who are the intended targets of the instruction being evaluated

Researchers sometimes claim their interventions work with populations of students they have never studied. Although I argued earlier that too many replications should be avoided, it makes sense to replicate intervention effects across populations who might benefit from the type of instruction being studied. For example, we know much more about the effects of phonics interventions on struggling readers than we know about their effects on strong beginning readers. We really do need to study the possible effects of such interventions on nonstruggling readers, because in many U.S. schools today the full range of beginning readers is receiving a lot of phonics instruction. Might that instruction bore better readers, perhaps reducing their motivation for reading and artificially compressing their reading achievement? Might an overemphasis on phonics instruction slow down nonstruggling readers' progress toward fluency, forcing attention to sounding out long after they really need to attend to the separate sounds and their blending? As far as I can tell, we have not really documented what happens when excellent beginning readers experience heavy doses of phonics instruction. We simply assume that what is good for weaker readers must also be good for the best beginning readers. Maybe it is, but I'm not so sure. Claims of broad applicability of an intervention require studies of it with a variety of populations.

Experimenters sometimes overstate their case

What can an experimenter conclude when a new type of instruction produces positive achievement relative to existing instruction? As I indicated earlier, the best bet is that the new form of instruction caused the achievement difference. Sometimes, however, that claim is broadened—either by the experimenter or someone else—to state that the new form of instruction is the best one. It is impossible to decide that some form of instruction is the best unless it is tested against all possible alternative forms of instruction that exist now or might exist in the future. When an experimenter claims that a form of instruction produces superior achievement, demand a comparison. If a claim is made that the instruction is superior to every other form of instruction, then reject the claim as impossible.

Demand also to know just how big an effect the instruction under investigation had. Be particularly carefully in interpreting claims that instructional effects are significant. The word significant has ambiguities associated with it when used to describe a research outcome. It can be used to state that the finding is statistically significant (e.g., there is less than a 5% or 1% or one tenth of 1% chance that the difference obtained is a chance difference). In this sense, a finding is considered to be highly significant when there is an exceptionally low probability that the difference observed was due to chance. Thus, when there is less than one tenth of 1% chance that a difference is due to chance, the finding is often referred to as highly significant—even though such a statistical difference can be quite small in absolute or practical terms, so small as to affect performance hardly at all. Most often, when a small practical or absolute effect has high statistical significance, it is because there were a lot of participants in each of the instructional conditions. When that is the case, a difference of high statistical significance can be inconsequential in terms of educational or practical classroom significance. That is, the students receiving the new intervention do only a little bit better than the students receiving the conventional instruction.

Those favoring meta-analysis have heightened our sensitivity to this issue because such analyses emphasize the size of an effect rather than its statistical significance. Still, there are many who are willing to praise interventions they favor because they produce effects of high statistical significance that are really very small absolute, practical, or educational effects. Demand to know (a) just how big statistically significant effects are in terms of practical or absolute effect sizes and (b) when the experiment and resulting conclusions were produced.

Conclusions can change with the times

There are some interventions in use that were tested a few decades ago in schools that might have been very different from schools today. I'm wary of accepting results produced years ago. Sometimes instruction works differently with different generations because of changes in the population or the general culture or school environment. In addition, conclusions about particular forms of instruction generated in 1953 were determined relative to control and comparison conditions that would not make sense in 2003.

This is not to say that approaches created years ago are necessarily ineffective, but they do need to be reevaluated from time to time. For example, Chall (1967) was able to make a pretty good case that synthetic phonics produced better word recognition than analytic phonics, at least in general. That conclusion was based on research up until the mid-1960s. When the National Reading Panel looked at phonics instruction in more recent research, however, it was not willing to provide the same endorsement for synthetic phonics approaches relative to other phonics approaches. The NRP concluded that a variety of systematic phonics approaches were equally effective. It is risky to draw strong conclusions about what works best today based on data produced in an earlier era.

Why try to provide evidence-based reading instruction?

I think evidence-based reading instruction is a good thing, although I am concerned about how narrowly it is being construed in many policy conversations. I urge educators to embrace the evidence-based direction and to be informed about and use the types of instruction that have been validated in experiments, but not just because there are experiments supporting such instruction.

Instructional interventions and practices validated in true experiments do show up in the teaching of excellent teachers. In recent work documenting what goes on in engaging and effective primary-grades classrooms, my colleagues and I watched literacy instruction that was filled with practices that enjoy support in true experiments (Pressley, Allington, Wharton-McDonald, Block, & Morrow, 2001; Pressley, Wharton-McDonald, et al., 2001; Wharton-McDonald, Pressley, & Hampston, 1998). Excellent primary-grades teachers flood their classrooms with motivational practices based on experimental research about how to motivate academic engagement (e.g., encouraging students to make effort attributions, making learning tasks interesting, engaging in cooperative learning). Their literacy curriculum and instruction are also filled with practices that have been validated in studies reported in literacy research journals (e.g., Pressley, 2002; Wood, Bruner, & Ross, 1976). These teachers' classroom management is consistent with the best validated management practices (e.g., Evertson, Emmer, Clements, & Worsham, 2000).

It is sad that, at best, we have found only about 20% of primary classrooms are engaging and effective. By comparison, instruction in less engaging and less effective classrooms also tends to be less consistent with instructional practices that have been validated in experimental research (Pressley, Allington, et al., 2001; Pressley, Wharton-McDonald, et al., 2001; Wharton-McDonald et al., 1998). Although our work portraying the best of reading instruction has filled us with optimism that evidence-based, consistent teaching can be very good, the same work has heightened our awareness that policymakers are justifiably concerned about the quality of teaching in the United States. The research advancements in the fields of literacy and literacy education have been magnificent during the past half-century. Educators and policymakers need to spend time looking at scientific evidence to determine how to reform literacy teaching, but neither group has digested this body of work enough to make the best use of it. I hope the 12 tips I have offered about how to understand and use experiments in reading education are helpful to those educators and policymakers who are responsible for teaching our children to read.

References

Achilles, C.M., Finn, J.D., & Bain, H.P. (1997). Using class size to reduce the equity gap. Educational Leadership, 55, 40–43.
Back

Ball, S., & Bogatz, G.A. (1970). The first year of "Sesame Street": An evaluation. Princeton, NJ: Educational Testing Service.
Back

Bogatz, G.A., & Ball, S. (1971). The second year of "Sesame Street": A continuing evaluation. Princeton, NJ: Educational Testing Service.
Back

Brown, R., Pressley, M., Van Meter, P., & Schuder, T. (1996). A quasi-experimental validation of transactional strategies instruction with low-achieving second grade readers. Journal of Educational Psychology, 88, 18–37.
Back

Chall, J.S. (1967). Learning to read: The great debate. New York: McGraw-Hill.
Back

Consortium for Longitudinal Studies. (1983). As the twig is bent. Hillsdale, NJ: Erlbaum.
Back

Cook, T.D., & Campbell, D.T. (1979). Quasi-experimentation: Design and analysis issues for field studies. New York: Rand-McNally.
Back

Emmer, E., Evertson, C., & Anderson, L. (1980). Effective classroom management at the beginning of the school year. Elementary School Journal, 80, 219–231.
Back

Evertson, C.M., Emmer, E.T., Clements, B.S., & Worsham, M.E. (2000). Classroom management for elementary teachers. Boston: Allyn & Bacon.
Back

Invernizzi, M.A. (2001). The complex world of one-on-one tutoring. In S.B. Neuman & D.K. Dickinson (Eds.), Handbook of early literacy research (pp. 459–470). New York: Guilford.
Back

Lazar, I., Darlington, R., Murray, H., Royce, J., & Sipper, A. (1982). Lasting effects of early childhood education: A report from the consortium for longitudinal studies. Monographs of the Society for Research in Child Development, 47(2–3, Whole No. 195).
Back

National Reading Panel. (2000). Teaching children to read: An evidence-based assessment of the scientific research literature on reading and its implications for reading instruction: Reports of the subgroups. Washington, DC: National Institute of Child Health and Development.
Back

Pinnell, G.S. (2000). Reading Recovery: An analysis of a research-based intervention. Columbus, OH: Reading Recovery Council of North America.
Back

Pressley, M. (2002). Effective beginning reading instruction: A paper commissioned by the National Reading Conference. Journal of Literacy Research, 34, 165–188.
Back

Pressley, M., & Afflerbach, P. (1995). Verbal protocols of reading: The nature of constructively responsive reading. Hillsdale, NJ: Erlbaum.
Back

Pressley, M., Allington, R., Wharton-McDonald, R., Block, C.C., & Morrow, L.M. (2001). Learning to read: Lessons from exemplary first grades. New York: Guilford.
Back

Pressley, M., Wharton-McDonald, R., Allington, R., Block, C.C., Morrow, L., Tracey, D., et al. (2001). A study of effective grade-1 literacy instruction. Scientific Studies of Reading, 5, 35–58.
Back

Rice, M.L., Huston, A.C., Truglio, R., & Wright, L.C. (1990). Words from Sesame Street: Learning vocabulary while viewing. Developmental Psychology, 26, 421–428.
Back

Sharan, S. (1980). Cooperative learning in small groups: Recent methods and effects on achievement, attitudes, and ethnic relations. Review of Educational Research, 50, 241–271.
Back

Slavin, R.E. (1990). Cooperative learning. Review of Educational Research, 50, 315–342.
Back

Wharton-McDonald, R., Pressley, M., & Hampston, J.M. (1998). Outstanding literacy instruction in first grade: Teacher practices and student achievement. Elementary School Journal, 99, 101–128.
Back

Wood, S.S., Bruner, J.S., & Ross, G. (1976). The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry, 17, 89–100.
Back


About the Author

Pressley teaches at Michigan State University (635 Applegate Lane, East Lansing, MI 48823, USA).

Back to top




For a printer-ready version of this article, click here.

Citation: Pressley, M. (2003, September). A few things reading educators should know about instructional experiments. Reading Teacher, 57(1). Available: http://www.readingonline.org/articles/art_index.asp?HREF=RT/9-03_column/index.html




Reading Online, www.readingonline.org
Published September 2003 in The Reading Teacher
Posted simultaneously in Reading Online
September 2003
© 2003 International Reading Association, Inc.   ISSN 1096-1232