About the Original Article's Tone
This is a peer-reviewed empirical journal article published in Instructional Science (2024) — a cognitive psychology / learning sciences journal, not a math education or ESOL-specific one. The intended audience is researchers in instructional design, cognitive load theory, and mathematics education. It is not written for classroom teachers.
It uses:
- Heavy statistical reporting — multiple regression tables, Cohen's f², VIF and Durbin-Watson diagnostic values, Shapiro-Wilk normality tests — assumes a reader comfortable with quantitative methods
- Cognitive load theory (CLT) framing throughout — Sweller's working memory / long-term memory architecture is the theoretical scaffolding, not linguistic or sociocultural theory
- Relatively dry, precise academic prose — sentences are long and loaded; the authors write to document, not to persuade
- A crossover structure — the waitlist design means each teacher was both treatment and control at different points; the article spends considerable space explaining this, which is unusual and slows reading
- Honest limitations section — the authors name their sample size, COVID conditions, and self-reported proficiency data as genuine constraints; this is refreshingly candid for an empirical study
The vibe: This feels like a careful lab report that wandered into a real classroom. You get a sense of researchers who genuinely tried to do this right under difficult conditions (during COVID, in urban sheltered-algebra classes, with a small N) — but the writing is designed to satisfy peer reviewers, not to make you want to run this in your classroom on Monday.
What it glosses over: The article says sentence frames were used but doesn't show you what they looked like at scale. The appendices have four sample WEPs (Worked Example Pairs) but no systematic description of how many frames, what linguistic structures, or how the phrase bank was organized. You get the forest — "sentence frames help ELs organize responses" — but not the trees. For MASL purposes, this gap is exactly where the original contribution lives.
Visual Metaphor
A songbird hatches in a nest it didn't choose. Before it can form a single note, it spends weeks hearing the same pattern from its parent — the same phrase, the same rhythm, the same specific call. It cannot produce it yet. But something is being inscribed: the neural template that will govern its voice for the rest of its life. Then one morning it opens its bill and out comes the song. Not a rough draft. The actual song. Learned before it could be spoken. The scaffold wasn't training wheels. It was the template itself.
What This Is Really About
You've heard the argument before: kids learn by doing, not by being shown. Worked examples are "passive." Sentence frames are "crutches." Let students figure it out themselves and they'll develop real understanding. It's an appealing idea. It's also, for students juggling a second language and algebra at the same time, often wrong.
Ke and Newton took the existing research on comparing worked examples — which has a solid track record in mainstream algebra classrooms — and asked the obvious question no one had gotten around to asking: does this work for English Learners (ELs) too? And if so, what modifications do you actually need to make?
The Core Idea
Modified for Language Support–Worked Example Pairs (MLS-WEPs) are the standard Worked Example Pair (WEP) curriculum adapted specifically for ELs in sheltered algebra classes. The core structure is unchanged: you put two worked examples side by side, and students compare them. What changed is everything around that comparison:
- Sentence frames were added — structured language prompts that scaffold how students write and speak about the comparison (e.g., "The similarity between the two methods is _____.") without reducing the cognitive demand of the mathematics itself
- Simplified prompt language — question wording was adjusted for language complexity without dumbing down the math
- Example-before-definition protocol — before asking students to define or explain a concept, they first provided a mathematical example of it (e.g., name examples of like terms before explaining what like terms are)
- Native language permission — students could use their home language to reason through problems, with the expectation of presenting in English; this isn't code-switching as weakness, it's cognitive scaffolding as strength
The four types of WEPs used in this study — which correspond directly to different comparison purposes — are:
- Which is better? — Two correct methods, one more efficient; students determine which is better in which circumstances
- Why does it work? — Two correct methods, one showing the conceptual rationale; students explain the why, not just the how
- Which is correct? — One correct and one incorrect method; students identify the error and explain it
- How do they differ? — Two different problem types solved similarly; illuminates structural mathematical features students often conflate
What They Found
The study ran across two algebra units — Linear Equations (Unit 1) and Functions (Unit 2) — using a waitlist design: Teacher 1's students got the MLS-WEPs intervention in Unit 1 while Teacher 2's were the control, then they switched for Unit 2. This meant both teachers were in both conditions, and both groups eventually received the intervention.
The main findings, controlling for prior knowledge:
- MLS-WEPs students scored approximately 0.48–0.50 standard deviations higher on calculation than control students (p < .01, Cohen's f² = 0.96–1.0)
- MLS-WEPs students scored approximately 0.70–0.73 standard deviations higher on written explanation than control students (p < .001, Cohen's f² = 1.44–2.03)
- The improvement in explanation quality transferred: students who received the Unit 1 intervention scored significantly higher on the Unit 2 explanation pretest (before receiving any Unit 2 intervention), suggesting the skill generalized across mathematical concepts
- The effectiveness of MLS-WEPs generally did not vary by English language proficiency level — students at proficiency levels 1 through 5 benefited comparably (with one exception in Unit 2 conceptual explanation, where Level 5 students outperformed Levels 1 and 3)
Why This Challenges the Status Quo
Math teachers who work with ELs are frequently told to "simplify" — reduce language demands, provide more computation, skip the explanation. The implicit assumption is that language is the obstacle and mathematics is the real goal. Ke and Newton's data suggest the opposite: structured language about mathematical procedures is itself the mechanism through which ELs develop both procedural skill AND conceptual understanding. The sentence frames didn't carry the students over the mathematical wall. They were the mechanism through which students scaled it.
There's also the proficiency-independence finding, which matters enormously for classroom organization. If MLS-WEPs works approximately equally well for Level 1 and Level 5 ELs (controlling for prior math knowledge), then teachers don't need to run separate interventions for different proficiency groups. One well-designed activity serves the room.
The Cognitive Load Story
The theoretical engine here is cognitive load theory (CLT). When ELs learn algebra in English, they're running two cognitively demanding processes simultaneously: (1) solving math problems and (2) decoding, comprehending, and producing mathematical language in a non-native tongue. Working memory is small and shared. Worked examples reduce the problem-solving load by showing the solution, freeing cognitive resources for the comparison task. Sentence frames reduce the language-production load, freeing resources for the mathematical reasoning. The two scaffolds work on different dimensions of cognitive demand — that's not a coincidence, it's the design logic.
The Big Picture
This is the first study to test worked example comparison in a sheltered ESOL algebra setting. Before this paper, you could argue (from the literature) that worked examples help with algebra, and that sentence frames help ELs, but you couldn't cite direct evidence that their combination works in actual ESOL sheltered algebra classes. Now you can. That's not nothing — it's the empirical foundation for any structured language-in-mathematics intervention aimed at secondary ELs. MASL sits squarely in this research space.
🔬 Evidence Audit
Study Snapshot
| Study Type | Quasi-experimental (waitlist crossover design — not a true RCT; assignment to condition was by teacher/class, not by individual student randomization) |
| Population | N = 78 ELs in sheltered algebra classes (47 from Teacher 1, 31 from Teacher 2); grades 9–11, predominantly 10th grade; WIDA proficiency levels 1–5; primary languages Spanish (47%), Portuguese (11%), French (20%), Chinese (9%), other; large urban K-12 district in Philadelphia area; study conducted during COVID-19 (all virtual instruction); year: 2020–2021 estimated |
| Intervention | MLS-WEPs (Modified for Language Support–Worked Example Pairs) — supplemental curriculum with sentence frames, simplified prompt language, example-before-definition sequencing, and native language permission; implemented by classroom teachers after topics were introduced; 2 units: Linear Equations and Functions; variable implementation timing (opening activity, example, or closing activity at teachers' discretion) |
| Control / Comparison | Active comparison — NOT a no-instruction control. Control group received the same mathematical examples and the same language supports as the treatment group, taught via standard teacher-modeled step-by-step instruction ("business as usual"). The key difference was the comparison structure and sentence frames, not the presence of language support. |
| Outcome Measures | Researcher-designed pre/post unit assessments with two components: (1) calculation accuracy (1 point per correct item) and (2) explanation quality (0–1 scale: fully correct = 1, partially correct = 0.5, incorrect = 0). Secondary qualitative coding: 6-category explanation rubric (fully correct, partially correct, conceptually relevant but incorrect, irrelevant, uninterpretable, blank). Cronbach's α = 0.70–0.84 across units/components. Inter-rater reliability > 85% on one-third of data. |
| Duration + Follow-up | Two algebra units (Linear Equations, Functions); no long-term follow-up after intervention ended; the Unit 2 pretest provided one transfer measure for Unit 1 intervention students |
| Funding / COI | Funding source not disclosed in the article. No conflicts of interest declared. One author (Ke) is from the participating School District of Philadelphia; the other (Newton) is from Temple University. Ke's institutional affiliation with the district is a potential source of bias worth noting. |
Evidence Quality
- ⚠️ Sample size adequate — N = 78 is small for regression analysis across two units with multiple predictors. The authors acknowledge this as a limitation and call results "preliminary." Effect sizes are large (f² = 0.96–2.03), which helps, but replication with larger N is essential before treating these as stable estimates.
- ⚠️ Groups comparable at baseline — Not randomized; condition was assigned by teacher. The control group started with significantly higher explanation scores at Unit 1 pretest (p = .020, d = 0.585), which the authors controlled for statistically. Appropriate remedy, but the baseline imbalance means the treatment and control teachers' classrooms were genuinely different — and those differences may extend beyond measured prior knowledge.
- ✅ Attrition handled properly — Two students (from Teacher 1's class) who missed more than 50% of instructional time were excluded from analysis; this is disclosed and the exclusion rule is pre-specified and reasonable. No other reported attrition. Final N = 78 from 80 original participants — minimal and transparent.
- ⚠️ Outcome measures validated — Researcher-designed assessments, not externally validated measures. Internal consistency is acceptable to good (α = 0.70–0.84). No test of construct validity beyond internal consistency. For preliminary research this is acceptable; for policy-level claims it would need external validation.
- ✅ Effect sizes reported — Cohen's f² and power are reported for all regression analyses. Effect sizes are large by any convention (f² = 0.96 is very large). The explanation component shows especially strong effects (f² = 2.03 for Unit 1). Statistical significance and practical significance align here.
- ❌ Pre-registration or published protocol — No pre-registration mentioned. Given the COVID context and crossover design, this is understandable but still a gap. We cannot rule out that some analytical choices (e.g., the decision not to use HLM, the choice of CLT threshold) were made after observing data patterns.
- ⚠️ Funding independent of findings — Funding source undisclosed. One author is affiliated with the school district that hosted the study. Not a disqualifying conflict, but worth flagging: district employees have institutional reasons to report positive outcomes from district-supported programs.
⚑ Red Flags & Questionable Logic
What happened: The waitlist/crossover design means each teacher was both treatment and control, but at different times for different units. This is explicitly acknowledged (p. 839) but the discussion doesn't fully address a genuine threat: the teacher who served as treatment in Unit 1 may have transferred some comparison techniques to her Unit 2 control instruction — and the lesson recordings were used to verify this did NOT happen (p. 840). The authors address this head-on, which is good, but the verification relies on teacher self-report behavior in recordings, not blind coding of pedagogical strategy.
Why it matters: If the Unit 2 "control" teacher had absorbed some comparison orientation from Unit 1, it would deflate the apparent treatment effect in Unit 2 — meaning the true effect might actually be larger than reported. But it could also mean contamination in the other direction. The design doesn't allow us to cleanly separate teacher effect from intervention effect.
The correct approach: Future replications should use a parallel design (different teachers in treatment and control simultaneously for the same unit) rather than a crossover. The crossover was likely a pragmatic choice given school constraints, but it confounds teacher and condition.
What happened: English proficiency was measured via student self-report, not official ACCESS (WIDA assessment) scores (p. 875). The authors explain that ACCESS data was unavailable due to COVID staffing issues, and that teacher review and adjustment were used to validate the self-reports.
Why it matters: The second research question — whether effectiveness varies by English proficiency — relies entirely on the accuracy of this measure. Self-reported proficiency on a 1–5 scale likely conflates multiple dimensions and may reflect students' confidence rather than their actual proficiency level. The proficiency-independence finding (one of the paper's strongest claims) rests on this compromised variable.
The correct approach: Replication should use official WIDA ACCESS scores or equivalent standardized proficiency measures. The authors are transparent about this limitation; it doesn't invalidate the finding but substantially weakens confidence in the proficiency-independence claim specifically.
Where More Evidence Is Needed
- Replication: This is a single study with N = 78, two teachers, one district, conducted during COVID (all virtual). It needs replication with larger samples, in-person conditions, and multiple districts before any strong causal claims hold.
- Population gaps: The study used mixed-proficiency sheltered classes. Newcomers (Level 1 only), long-term ELs, and students with interrupted formal education were all present but not analyzed separately. Effects may vary substantially across these subgroups.
- Duration: No follow-up data beyond the Unit 2 pretest transfer measure. Do the written explanation gains persist at the end of the school year? Into the following year? The intervention's long-term value is entirely unknown.
- Mechanism: Was it the sentence frames? The comparison structure? The native language permission? The example-before-definition sequencing? The study shows the package works; it does not isolate which ingredient is the active one. For MASL design purposes, this is a critical gap — if the frames don't drive the effect and the comparison does, that has design implications.
- Implementation fidelity: Teachers had flexibility to implement at any point in the lesson (opening, example, or closing) and did not use small groups (unusual for WEPs). Real-world implementation with prescribed timing and partner work might produce different effects.
- Spoken vs. written language: All assessment was written. The study measured written explanations and calculation accuracy. Whether MLS-WEPs improves spoken mathematical language — which is MASL's specific target — remains completely unmeasured and unknown.
Key Vocabulary
Terms used centrally in the article, sorted A–Z.
🎯 MASL Connection
This Study Supports:
- Worked Example: Language Frames (most direct support): Ke & Newton provide the only existing empirical study of worked examples combined with sentence frames for secondary ELs in algebra — the exact population and the exact structural design of MASL's Language Frames activity. The effect on written explanation quality (d ≈ 0.70, f² = 1.44–2.03) directly supports MASL's claim that structured language prompts during worked examples improve ELs' ability to articulate mathematical reasoning. The sentence frames in MLS-WEPs scaffold comparative discourse about procedure ("The similarity between the two methods is ___"); MASL's frames scaffold notation-register discourse ("I read this symbol as ___ because ___").
- The proficiency-independence claim: MLS-WEPs worked approximately equally for ELs at proficiency levels 1–5 (controlling for prior math knowledge). This supports MASL's position that notation-language instruction is appropriate for all secondary ELs, not just those with higher English proficiency — and more broadly, that structured language frames are not remedial but a mechanism all students benefit from.
- Transfer of explanation skills across mathematical concepts: Students who received the Unit 1 intervention showed significantly higher explanation scores on the Unit 2 pretest for a completely different mathematical topic. This supports MASL's design assumption that language frames for notation are generative skills, not topic-specific drills — once students have the register for describing symbolic operations, that skill is available across content domains.
Design Implications:
- Phrase bank framing matters more than size alone: MLS-WEPs sentence frames were discourse-oriented ("The similarity between...," "I chose this method because...") — they scaffold comparative reasoning, not just fill-in-the-blank vocabulary. MASL's phrase bank of 9 terms should similarly foreground reasoning structure. The MASL capstone already responds to Barko-Alva & Chang-Bacon's overframing critique by targeting the reasoning register, not the conclusion register; Ke & Newton provide indirect support for that design choice.
- Example-before-definition sequencing: MLS-WEPs explicitly sequenced a student-generated example before the conceptual explanation (e.g., "give an example of like terms" before "define like terms"). MASL could adopt this for notation-language instruction: ask students to write how they'd say a symbol before showing the MASL standard form. This primes the irregular-form correction and creates the "disorienting dilemma" Mezirow describes.
- Native language permission: MLS-WEPs allowed and encouraged students to reason in their home language before presenting in English. MASL's Language Frames activity should explicitly permit this — not as accommodation but as cognitive strategy. Students reasoning through why "f(x) is NOT f times x" may do that reasoning most efficiently in Spanish, Portuguese, or Mandarin.
- Supplemental, not replacement: MLS-WEPs were implemented after the corresponding mathematical topic was introduced. MASL's Language Frames should follow the same logic — introduce the notation in context first, then deploy the frame activity to consolidate the spoken register. The frame is not the introduction; it's the consolidation mechanism.
Evidence Strength for MASL:
This is the strongest single empirical study in MASL's research base — population match is excellent (secondary ELs in algebra), intervention structure is directly analogous (worked examples + sentence frames), and the outcomes include both procedural performance and explanation quality. However, the key gap is critical: MLS-WEPs targets mathematical discourse broadly (explaining procedures, comparing methods, describing concepts) — not spoken notation specifically. None of the sentence frames in the appendices target the spoken register of algebraic symbols. There is no measurement of how students name "f(x)" or "x²" or "±." The evidence bridges to MASL's Language Frames activity at the structural level (worked examples + sentence frames + EL population = positive outcomes) but does not bridge to MASL's specific claim about spoken notation being a distinct, teachable target. That gap is MASL's original contribution — and Ke & Newton's silence on it is exactly what justifies a new study.
Connections to MASL Framework (click to expand)
- MASL Trio (Math / We Say / Meaning cards): The WEP comparison structure — particularly "Why does it work?" and "How do they differ?" types — activates the same analogical reasoning the card sort requires. Students in WEPs had to explain why two methods work; card sort students explain why "x squared" and "x to the second power" mean the same thing. Structural parallel, different linguistic target.
- Sentence frames: Most direct parallel. MLS-WEPs sentence frames scaffold comparative discourse; MASL frames scaffold notation-reading discourse. Both share the design logic of providing grammatical structure so cognitive resources go to content reasoning. The key difference: MLS-WEPs frames are about procedure, MASL frames are about symbol-to-speech mapping.
- Irregular forms instruction: MLS-WEPs does not address notation irregularity. None of the four appendix examples show frames targeting f(x) vs. "f times x" or x² vs. "x to the 2." This is the gap MASL fills — Ke & Newton give you the method; MASL gives you a target the method has never been aimed at.
- Scaffolding fading: MLS-WEPs does not describe a fading plan. Sentence frames were present throughout both units. MASL's fading protocol (full → partial → blank over Lessons N to N+4) is a design upgrade Ke & Newton did not test, though the expertise reversal effect literature supports it strongly.
💬 Key Quotes
Copy-paste ready quotes for papers, discussions, and the MASL capstone.
📚 References & Further Reading
Key sources from the paper's reference list, assessed for MASL relevance.
Core Worked Example Research (WEPs Lineage)
What it is: The original experimental study establishing that comparing two worked examples in algebra produces learning gains beyond sequentially studying each. Tone: Technical experimental report. Why it matters: The entire WEP lineage — including MLS-WEPs — rests on this finding. Buzz: Highly cited (800+); foundational to all subsequent WEP research. Verdict: Required background for any worked example citation; dense but the results section is readable and the implications are clear.
What it is: The most recent update to the WEP curriculum that MLS-WEPs adapted. Tone: Accessible experimental report with practical implications. Why it matters: If you're citing Ke & Newton, you should also know what the base curriculum looks like — this is it. Verdict: Read before presenting MLS-WEPs in a capstone literature review.
What it is: Recent study extending worked examples to error anticipation in real algebra classrooms. Tone: Standard empirical report. Why it matters: Part of the "AlgebraByExample" strand directly relevant to MASL's Suggest Improvements activity. Verdict: Worth reading if building the evidence base for erroneous examples.
Cognitive Load Theory (CLT) Core References
What it is: The paper that introduced cognitive load theory — working memory limits, intrinsic vs. extraneous vs. germane load. Tone: Dense theoretical; early 1980s cognitive psychology style. Why it matters: The theoretical engine for all worked example research. Verdict: Skim the introduction and conclusions; the specific findings on problem-solving vs. worked example studies are what matter for MASL.
What it is: A brief accessible synthesis of CLT principles for instructional designers. Tone: Relatively accessible overview. Why it matters: Good entry point for CLT if you need to explain the theoretical basis for MASL scaffolding without reading the dense originals. Verdict: 8 pages — read this before citing CLT.
Mathematics Learning for English Learners
What it is: Moschkovich's synthesis of Academic Literacy in Mathematics (ALM) framework — one of MASL's eight core theoretical anchors. Tone: Accessible theoretical synthesis. Why it matters: Establishes that mathematical proficiency = mathematical practices + discourse + content inseparably; grounds the claim that language instruction benefits all students. Verdict: Required for MASL capstone.
What it is: Comprehensive practitioner-oriented book on math instruction for ELs; heavily cited in Ke & Newton. Tone: Practitioner-friendly with research backing. Why it matters: Provides the EL math instruction framework underlying MLS-WEPs design decisions. Verdict: Worth having as a reference; not necessary to read cover-to-cover for MASL.
What it is: Short practitioner-facing article on sentence frames for EL academic language. Tone: Accessible; designed for classroom teachers. Why it matters: Primary citation Ke & Newton use to justify sentence frame design — useful if you need to defend the sentence frame choice in MASL. Verdict: 5-minute read; worth having in your citation toolkit.
What it is: EDC practitioner guide on sentence frames specifically for mathematics discourse. Tone: Practitioner guide. Why it matters: The most practice-oriented source on math-specific sentence frame design — directly applicable to MASL phrase bank construction. Verdict: Read if designing the specific sentence frames for MASL Language Frames activity.
🧠 Quiz — Test Your Understanding
Six conceptual questions about the ideas — not the statistics.
1. Why did Ke and Newton add sentence frames to the standard Worked Example Pairs (WEPs) design for their EL version?
2. One of the study's most striking findings was that MLS-WEPs effectiveness "generally did not vary by English language proficiency." What does this mean for instructional design?
3. Students who received the Unit 1 MLS-WEPs intervention scored significantly higher on the Unit 2 explanation pretest — before any Unit 2 instruction. What does this transfer finding suggest?
4. The four types of Worked Example Pairs (WEPs) — "Which is better?", "Why does it work?", "Which is correct?", and "How do they differ?" — each serve different purposes. Which type is MOST focused on developing conceptual understanding rather than procedural flexibility?
5. The study found that English language proficiency DID significantly affect explanation scores in Unit 2 (Functions) but NOT in Unit 1 (Equations). The authors hypothesize this is because Unit 2 assessed conceptual knowledge while Unit 1 assessed mostly procedural knowledge. What's the instructional implication?
6. The study showed that the quality of MLS-WEPs written explanations improved substantially in the treatment group. "Blank" responses dropped from 45% to 9% in Unit 1. But the "uninterpretable explanation" category increased in the treatment group and decreased in the control group. What's the most likely explanation for this unexpected pattern?
🔬 Research Quiz
Six questions about the study design — not the content. Can you read past the authors' framing?
1. The study uses a "waitlist crossover design." What type of study is this, and what does that mean for causal claims?
2. What was the actual population studied, and who is notably absent from this sample?
3. What did the control group actually receive? This is important because it determines how large a comparative advantage the treatment really represents.
4. English language proficiency was a key variable in the study's second research question. How was it measured, and why does this matter?
5. The regression analysis for Unit 1 found Cohen's f² = 0.96 (calculation) and f² = 2.03 (explanation). How do you interpret these effect sizes in practical terms?
6. [Red Flag] At the beginning of Unit 1, the control group scored significantly higher on the explanation pretest than the treatment group (p = .020, d = 0.585). The authors controlled for this statistically and found treatment effects at posttest. What's the methodological concern, and why does it matter beyond the statistical control?
🃏 Match the Concepts
Drag each term from the left column to its matching description on the right.
Terms & Concepts
Descriptions
Replication
✅ What They Got Right
- Active, not passive, comparison condition. The control group received the same language supports and the same mathematical examples — just via traditional instruction. This design isolates the comparison structure as the active ingredient, rather than confounding "any language support vs. none." This is methodologically stronger than most intervention studies that compare against truly unsupported instruction.
- Transparent about limitations. The small N, self-reported proficiency, COVID conditions, and virtual-only implementation are all named explicitly as constraints. The authors call findings "preliminary" and explicitly invite replication. This intellectual honesty is rare and should be cited when using this study — the claims stay within the data.
- Transfer measure built into the design. The waitlist structure accidentally created a transfer test: Unit 1 intervention students' performance on the Unit 2 explanation pretest provides a cross-topic transfer measure that most worked example studies don't include. This produced one of the study's most valuable findings at no additional cost.
- Six-category explanation rubric with inter-rater reliability. Rather than binary correct/incorrect, the explanation coding captured qualitative variation in student responses — blank, irrelevant, uninterpretable, concept-relevant-but-incorrect, partially correct, fully correct. The > 85% inter-rater reliability and resolution-through-discussion protocol are appropriate for this type of rubric.
- Pilot study with student feedback before finalizing design. The MLS-WEPs modifications (sentence frames, simplified prompts, native language permission) were based on a pilot study with EL students in one-on-one tutorials, not researcher assumptions. This grounding in actual student experience is the right design sequence.
🔧 Suggested Improvements
- Use official WIDA ACCESS scores, not self-reported proficiency. — Why: The entire second research question (does effectiveness vary by proficiency?) depends on the validity of the proficiency measure. Self-reported scores during COVID are the weakest possible proxy; ACCESS scores would allow genuine proficiency-subgroup analysis and make the proficiency-independence finding publishable as an established finding rather than a preliminary one.
- Implement partner/small-group work as the WEP design originally specifies. — Why: The standard WEP model involves paired student comparison and discussion — this was abandoned because teachers were uncomfortable with virtual small groups. MASL and most real-world implementations will use partner work; a replication under in-person conditions with partner structures would better represent how the intervention actually functions.
- Add a spoken language outcome measure alongside the written one. — Why: MLS-WEPs targets four language domains (reading, writing, listening, speaking) but only writing was measured in the assessments. Spoken mathematical explanation — the specific target of MASL's Language Frames activity — remains unmeasured. An audio or video-coded oral explanation task would capture whether the benefits extend to spoken register.
- Use parallel groups (same unit, same time, different teachers) rather than crossover. — Why: The crossover design confounds teacher identity with condition — Teacher 1's and Teacher 2's classrooms were not equivalent at baseline (d = 0.585 on explanation pretest). A parallel design with multiple teacher-pairs randomly assigned to condition within the same instructional unit would provide cleaner causal inference.
- Systematically document and vary frame design features (number of frames, frame type, fading schedule). — Why: The current study treats sentence frames as a single ingredient; it doesn't tell us how many frames, what linguistic structures, or whether fading frames after initial exposure changes outcomes. For MASL design, these are the critical parameters — a dismantling study would isolate which frame features drive the explanation quality gains.