ab-test-setup
When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "conversion experiment," "statistical significance," or "test this." For tracking implementation, see analytics-tracking.
Unlock Deep Analysis
Use AI to visualize the workflow and generate a realistic output preview for this skill.
Powered by Fastest LLM
A/B Test Setup
You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
Initial Assessment
Check for product marketing context first:
If .claude/product-marketing-context.md exists, read it before asking questions. Use that context and only ask for information not already covered or specific to this task.
Before designing a test, understand:
- Test Context - What are you trying to improve? What change are you considering?
- Current State - Baseline conversion rate? Current traffic volume?
- Constraints - Technical complexity? Timeline? Tools available?
Core Principles
1. Start with a Hypothesis
- Not just "let's see what happens"
- Specific prediction of outcome
- Based on reasoning or data
2. Test One Thing
- Single variable per test
- Otherwise you don't know what worked
3. Statistical Rigor
- Pre-determine sample size
- Don't peek and stop early
- Commit to the methodology
4. Measure What Matters
- Primary metric tied to business value
- Secondary metrics for context
- Guardrail metrics to prevent harm
Hypothesis Framework
Structure
Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].
Example
Weak: "Changing the button color might increase clicks."
Strong: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
Test Types
| Type | Description | Traffic Needed |
|---|---|---|
| A/B | Two versions, single change | Moderate |
| A/B/n | Multiple variants | Higher |
| MVT | Multiple changes in combinations | Very high |
| Split URL | Different URLs for variants | Moderate |
Sample Size
Quick Reference
| Baseline | 10% Lift | 20% Lift | 50% Lift |
|---|---|---|---|
| 1% | 150k/variant | 39k/variant | 6k/variant |
| 3% | 47k/variant | 12k/variant | 2k/variant |
| 5% | 27k/variant | 7k/variant | 1.2k/variant |
| 10% | 12k/variant | 3k/variant | 550/variant |
Calculators:
For detailed sample size tables and duration calculations: See references/sample-size-guide.md
Metrics Selection
Primary Metric
- Single metric that matters most
- Directly tied to hypothesis
- What you'll use to call the test
Secondary Metrics
- Support primary metric interpretation
- Explain why/how the change worked
Guardrail Metrics
- Things that shouldn't get worse
- Stop test if significantly negative
Example: Pricing Page Test
- Primary: Plan selection rate
- Secondary: Time on page, plan distribution
- Guardrail: Support tickets, refund rate
Designing Variants
What to Vary
| Category | Examples |
|---|---|
| Headlines/Copy | Message angle, value prop, specificity, tone |
| Visual Design | Layout, color, images, hierarchy |
| CTA | Button copy, size, placement, number |
| Content | Information included, order, amount, social proof |
Best Practices
- Single, meaningful change
- Bold enough to make a difference
- True to the hypothesis
Traffic Allocation
| Approach | Split | When to Use |
|---|---|---|
| Standard | 50/50 | Default for A/B |
| Conservative | 90/10, 80/20 | Limit risk of bad variant |
| Ramping | Start small, increase | Technical risk mitigation |
Considerations:
- Consistency: Users see same variant on return
- Balanced exposure across time of day/week
Implementation
Client-Side
- JavaScript modifies page after load
- Quick to implement, can cause flicker
- Tools: PostHog, Optimizely, VWO
Server-Side
- Variant determined before render
- No flicker, requires dev work
- Tools: PostHog, LaunchDarkly, Split
Running the Test
Pre-Launch Checklist
- Hypothesis documented
- Primary metric defined
- Sample size calculated
- Variants implemented correctly
- Tracking verified
- QA completed on all variants
During the Test
DO:
- Monitor for technical issues
- Check segment quality
- Document external factors
DON'T:
- Peek at results and stop early
- Make changes to variants
- Add traffic from new sources
The Peeking Problem
Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.
Analyzing Results
Statistical Significance
- 95% confidence = p-value < 0.05
- Means <5% chance result is random
- Not a guaranteeβjust a threshold
Analysis Checklist
- Reach sample size? If not, result is preliminary
- Statistically significant? Check confidence intervals
- Effect size meaningful? Compare to MDE, project impact
- Secondary metrics consistent? Support the primary?
- Guardrail concerns? Anything get worse?
- Segment differences? Mobile vs. desktop? New vs. returning?
Interpreting Results
| Result | Conclusion |
|---|---|
| Significant winner | Implement variant |
| Significant loser | Keep control, learn why |
| No significant difference | Need more traffic or bolder test |
| Mixed signals | Dig deeper, maybe segment |
Documentation
Document every test with:
- Hypothesis
- Variants (with screenshots)
- Results (sample, metrics, significance)
- Decision and learnings
For templates: See references/test-templates.md
Common Mistakes
Test Design
- Testing too small a change (undetectable)
- Testing too many things (can't isolate)
- No clear hypothesis
Execution
- Stopping early
- Changing things mid-test
- Not checking implementation
Analysis
- Ignoring confidence intervals
- Cherry-picking segments
- Over-interpreting inconclusive results
Task-Specific Questions
- What's your current conversion rate?
- How much traffic does this page get?
- What change are you considering and why?
- What's the smallest improvement worth detecting?
- What tools do you have for testing?
- Have you tested this area before?
Proactive Triggers
Proactively offer A/B test design when:
- Conversion rate mentioned β User shares a conversion rate and asks how to improve it; suggest designing a test rather than guessing at solutions.
- Copy or design decision is unclear β When two variants of a headline, CTA, or layout are being debated, propose testing instead of opinionating.
- Campaign underperformance β User reports a landing page or email performing below expectations; offer a structured test plan.
- Pricing page discussion β Any mention of pricing page changes should trigger an offer to design a pricing test with guardrail metrics.
- Post-launch review β After a feature or campaign goes live, propose follow-up experiments to optimize the result.
Output Artifacts
| Artifact | Format | Description |
|---|---|---|
| Experiment Brief | Markdown doc | Hypothesis, variants, metrics, sample size, duration, owner |
| Sample Size Calculator Input | Table | Baseline rate, MDE, confidence level, power |
| Pre-Launch QA Checklist | Checklist | Implementation, tracking, variant rendering verification |
| Results Analysis Report | Markdown doc | Statistical significance, effect size, segment breakdown, decision |
| Test Backlog | Prioritized list | Ranked experiments by expected impact and feasibility |
Communication
All outputs should meet the quality standard: clear hypothesis, pre-registered metrics, and documented decisions. Avoid presenting inconclusive results as wins. Every test should produce a learning, even if the variant loses. Reference marketing-context for product and audience framing before designing experiments.
Related Skills
- page-cro β USE when you need ideas for what to test; NOT when you already have a hypothesis and just need test design.
- analytics-tracking β USE to set up measurement infrastructure before running tests; NOT as a substitute for defining primary metrics upfront.
- campaign-analytics β USE after tests conclude to fold results into broader campaign attribution; NOT during the test itself.
- pricing-strategy β USE when test results affect pricing decisions; NOT to replace a controlled test with pure strategic reasoning.
- marketing-context β USE as foundation before any test design to ensure hypotheses align with ICP and positioning; always load first.
Referenced Files
The following files are referenced in this skill and included for context.
references/sample-size-guide.md
# Sample Size Guide
Reference for calculating sample sizes and test duration.
## Sample Size Fundamentals
### Required Inputs
1. **Baseline conversion rate**: Your current rate
2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
3. **Statistical significance level**: Usually 95% (Ξ± = 0.05)
4. **Statistical power**: Usually 80% (Ξ² = 0.20)
### What These Mean
**Baseline conversion rate**: If your page converts at 5%, that's your baseline.
**MDE (Minimum Detectable Effect)**: The smallest improvement you care about detecting. Set this based on:
- Business impact (is a 5% lift meaningful?)
- Implementation cost (worth the effort?)
- Realistic expectations (what have past tests shown?)
**Statistical significance (95%)**: Means there's less than 5% chance the observed difference is due to random chance.
**Statistical power (80%)**: Means if there's a real effect of size MDE, you have 80% chance of detecting it.
---
## Sample Size Quick Reference Tables
### Conversion Rate: 1%
| Lift to Detect | Sample per Variant | Total Sample |
|----------------|-------------------|--------------|
| 5% (1% β 1.05%) | 1,500,000 | 3,000,000 |
| 10% (1% β 1.1%) | 380,000 | 760,000 |
| 20% (1% β 1.2%) | 97,000 | 194,000 |
| 50% (1% β 1.5%) | 16,000 | 32,000 |
| 100% (1% β 2%) | 4,200 | 8,400 |
### Conversion Rate: 3%
| Lift to Detect | Sample per Variant | Total Sample |
|----------------|-------------------|--------------|
| 5% (3% β 3.15%) | 480,000 | 960,000 |
| 10% (3% β 3.3%) | 120,000 | 240,000 |
| 20% (3% β 3.6%) | 31,000 | 62,000 |
| 50% (3% β 4.5%) | 5,200 | 10,400 |
| 100% (3% β 6%) | 1,400 | 2,800 |
### Conversion Rate: 5%
| Lift to Detect | Sample per Variant | Total Sample |
|----------------|-------------------|--------------|
| 5% (5% β 5.25%) | 280,000 | 560,000 |
| 10% (5% β 5.5%) | 72,000 | 144,000 |
| 20% (5% β 6%) | 18,000 | 36,000 |
| 50% (5% β 7.5%) | 3,100 | 6,200 |
| 100% (5% β 10%) | 810 | 1,620 |
### Conversion Rate: 10%
| Lift to Detect | Sample per Variant | Total Sample |
|----------------|-------------------|--------------|
| 5% (10% β 10.5%) | 130,000 | 260,000 |
| 10% (10% β 11%) | 34,000 | 68,000 |
| 20% (10% β 12%) | 8,700 | 17,400 |
| 50% (10% β 15%) | 1,500 | 3,000 |
| 100% (10% β 20%) | 400 | 800 |
### Conversion Rate: 20%
| Lift to Detect | Sample per Variant | Total Sample |
|----------------|-------------------|--------------|
| 5% (20% β 21%) | 60,000 | 120,000 |
| 10% (20% β 22%) | 16,000 | 32,000 |
| 20% (20% β 24%) | 4,000 | 8,000 |
| 50% (20% β 30%) | 700 | 1,400 |
| 100% (20% β 40%) | 200 | 400 |
---
## Duration Calculator
### Formula
Duration (days) = (Sample per variant Γ Number of variants) / (Daily traffic Γ % exposed)
### Examples
**Scenario 1: High-traffic page**
- Need: 10,000 per variant (2 variants = 20,000 total)
- Daily traffic: 5,000 visitors
- 100% exposed to test
- Duration: 20,000 / 5,000 = **4 days**
**Scenario 2: Medium-traffic page**
- Need: 30,000 per variant (60,000 total)
- Daily traffic: 2,000 visitors
- 100% exposed
- Duration: 60,000 / 2,000 = **30 days**
**Scenario 3: Low-traffic with partial exposure**
- Need: 15,000 per variant (30,000 total)
- Daily traffic: 500 visitors
- 50% exposed to test
- Effective daily: 250
- Duration: 30,000 / 250 = **120 days** (too long!)
### Minimum Duration Rules
Even with sufficient sample size, run tests for at least:
- **1 full week**: To capture day-of-week variation
- **2 business cycles**: If B2B (weekday vs. weekend patterns)
- **Through paydays**: If e-commerce (beginning/end of month)
### Maximum Duration Guidelines
Avoid running tests longer than 4-8 weeks:
- Novelty effects wear off
- External factors intervene
- Opportunity cost of other tests
---
## Online Calculators
### Recommended Tools
**Evan Miller's Calculator**
https://www.evanmiller.org/ab-testing/sample-size.html
- Simple interface
- Bookmark-worthy
**Optimizely's Calculator**
https://www.optimizely.com/sample-size-calculator/
- Business-friendly language
- Duration estimates
**AB Test Guide Calculator**
https://www.abtestguide.com/calc/
- Includes Bayesian option
- Multiple test types
**VWO Duration Calculator**
https://vwo.com/tools/ab-test-duration-calculator/
- Duration-focused
- Good for planning
---
## Adjusting for Multiple Variants
With more than 2 variants (A/B/n tests), you need more sample:
| Variants | Multiplier |
|----------|------------|
| 2 (A/B) | 1x |
| 3 (A/B/C) | ~1.5x |
| 4 (A/B/C/D) | ~2x |
| 5+ | Consider reducing variants |
**Why?** More comparisons increase chance of false positives. You're comparing:
- A vs B
- A vs C
- B vs C (sometimes)
Apply Bonferroni correction or use tools that handle this automatically.
---
## Common Sample Size Mistakes
### 1. Underpowered tests
**Problem**: Not enough sample to detect realistic effects
**Fix**: Be realistic about MDE, get more traffic, or don't test
### 2. Overpowered tests
**Problem**: Waiting for sample size when you already have significance
**Fix**: This is actually fineβyou committed to sample size, honor it
### 3. Wrong baseline rate
**Problem**: Using wrong conversion rate for calculation
**Fix**: Use the specific metric and page, not site-wide averages
### 4. Ignoring segments
**Problem**: Calculating for full traffic, then analyzing segments
**Fix**: If you plan segment analysis, calculate sample for smallest segment
### 5. Testing too many things
**Problem**: Dividing traffic too many ways
**Fix**: Prioritize ruthlessly, run fewer concurrent tests
---
## When Sample Size Requirements Are Too High
Options when you can't get enough traffic:
1. **Increase MDE**: Accept only detecting larger effects (20%+ lift)
2. **Lower confidence**: Use 90% instead of 95% (risky, document it)
3. **Reduce variants**: Test only the most promising variant
4. **Combine traffic**: Test across multiple similar pages
5. **Test upstream**: Test earlier in funnel where traffic is higher
6. **Don't test**: Make decision based on qualitative data instead
7. **Longer test**: Accept longer duration (weeks/months)
---
## Sequential Testing
If you must check results before reaching sample size:
### What is it?
Statistical method that adjusts for multiple looks at data.
### When to use
- High-risk changes
- Need to stop bad variants early
- Time-sensitive decisions
### Tools that support it
- Optimizely (Stats Accelerator)
- VWO (SmartStats)
- PostHog (Bayesian approach)
### Tradeoff
- More flexibility to stop early
- Slightly larger sample size requirement
- More complex analysis
---
## Quick Decision Framework
### Can I run this test?
Daily traffic to page: _____ Baseline conversion rate: _____ MDE I care about: _____
Sample needed per variant: _____ (from tables above) Days to run: Sample / Daily traffic = _____
If days > 60: Consider alternatives If days > 30: Acceptable for high-impact tests If days < 14: Likely feasible If days < 7: Easy to run, consider running longer anyway
references/test-templates.md
# A/B Test Templates Reference
Templates for planning, documenting, and analyzing experiments.
## Test Plan Template
```markdown
# A/B Test: [Name]
## Overview
- **Owner**: [Name]
- **Test ID**: [ID in testing tool]
- **Page/Feature**: [What's being tested]
- **Planned dates**: [Start] - [End]
## Hypothesis
Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].
## Test Design
| Element | Details |
|---------|---------|
| Test type | A/B / A/B/n / MVT |
| Duration | X weeks |
| Sample size | X per variant |
| Traffic allocation | 50/50 |
| Tool | [Tool name] |
| Implementation | Client-side / Server-side |
## Variants
### Control (A)
[Screenshot]
- Current experience
- [Key details about current state]
### Variant (B)
[Screenshot or mockup]
- [Specific change #1]
- [Specific change #2]
- Rationale: [Why we think this will win]
## Metrics
### Primary
- **Metric**: [metric name]
- **Definition**: [how it's calculated]
- **Current baseline**: [X%]
- **Minimum detectable effect**: [X%]
### Secondary
- [Metric 1]: [what it tells us]
- [Metric 2]: [what it tells us]
- [Metric 3]: [what it tells us]
### Guardrails
- [Metric that shouldn't get worse]
- [Another safety metric]
## Segment Analysis Plan
- Mobile vs. desktop
- New vs. returning visitors
- Traffic source
- [Other relevant segments]
## Success Criteria
- Winner: [Primary metric improves by X% with 95% confidence]
- Loser: [Primary metric decreases significantly]
- Inconclusive: [What we'll do if no significant result]
## Pre-Launch Checklist
- [ ] Hypothesis documented and reviewed
- [ ] Primary metric defined and trackable
- [ ] Sample size calculated
- [ ] Test duration estimated
- [ ] Variants implemented correctly
- [ ] Tracking verified in all variants
- [ ] QA completed on all variants
- [ ] Stakeholders informed
- [ ] Calendar hold for analysis date
Results Documentation Template
# A/B Test Results: [Name]
## Summary
| Element | Value |
|---------|-------|
| Test ID | [ID] |
| Dates | [Start] - [End] |
| Duration | X days |
| Result | Winner / Loser / Inconclusive |
| Decision | [What we're doing] |
## Hypothesis (Reminder)
[Copy from test plan]
## Results
### Sample Size
| Variant | Target | Actual | % of target |
|---------|--------|--------|-------------|
| Control | X | Y | Z% |
| Variant | X | Y | Z% |
### Primary Metric: [Metric Name]
| Variant | Value | 95% CI | vs. Control |
|---------|-------|--------|-------------|
| Control | X% | [X%, Y%] | β |
| Variant | X% | [X%, Y%] | +X% |
**Statistical significance**: p = X.XX (95% = sig / not sig)
**Practical significance**: [Is this lift meaningful for the business?]
### Secondary Metrics
| Metric | Control | Variant | Change | Significant? |
|--------|---------|---------|--------|--------------|
| [Metric 1] | X | Y | +Z% | Yes/No |
| [Metric 2] | X | Y | +Z% | Yes/No |
### Guardrail Metrics
| Metric | Control | Variant | Change | Concern? |
|--------|---------|---------|--------|----------|
| [Metric 1] | X | Y | +Z% | Yes/No |
### Segment Analysis
**Mobile vs. Desktop**
| Segment | Control | Variant | Lift |
|---------|---------|---------|------|
| Mobile | X% | Y% | +Z% |
| Desktop | X% | Y% | +Z% |
**New vs. Returning**
| Segment | Control | Variant | Lift |
|---------|---------|---------|------|
| New | X% | Y% | +Z% |
| Returning | X% | Y% | +Z% |
## Interpretation
### What happened?
[Explanation of results in plain language]
### Why do we think this happened?
[Analysis and reasoning]
### Caveats
[Any limitations, external factors, or concerns]
## Decision
**Winner**: [Control / Variant]
**Action**: [Implement variant / Keep control / Re-test]
**Timeline**: [When changes will be implemented]
## Learnings
### What we learned
- [Key insight 1]
- [Key insight 2]
### What to test next
- [Follow-up test idea 1]
- [Follow-up test idea 2]
### Impact
- **Projected lift**: [X% improvement in Y metric]
- **Business impact**: [Revenue, conversions, etc.]
Test Repository Entry Template
For tracking all tests in a central location:
| Test ID | Name | Page | Dates | Primary Metric | Result | Lift | Link |
|---------|------|------|-------|----------------|--------|------|------|
| 001 | Hero headline test | Homepage | 1/1-1/15 | CTR | Winner | +12% | [Link] |
| 002 | Pricing table layout | Pricing | 1/10-1/31 | Plan selection | Loser | -5% | [Link] |
| 003 | Signup form fields | Signup | 2/1-2/14 | Completion | Inconclusive | +2% | [Link] |
Quick Test Brief Template
For simple tests that don't need full documentation:
## [Test Name]
**What**: [One sentence description]
**Why**: [One sentence hypothesis]
**Metric**: [Primary metric]
**Duration**: [X weeks]
**Result**: [TBD / Winner / Loser / Inconclusive]
**Learnings**: [Key takeaway]
Stakeholder Update Template
## A/B Test Update: [Name]
**Status**: Running / Complete
**Days remaining**: X (or complete)
**Current sample**: X% of target
### Preliminary observations
[What we're seeing - without making decisions yet]
### Next steps
[What happens next]
### Timeline
- [Date]: Analysis complete
- [Date]: Decision and recommendation
- [Date]: Implementation (if winner)
Experiment Prioritization Scorecard
For deciding which tests to run:
| Factor | Weight | Test A | Test B | Test C |
|---|---|---|---|---|
| Potential impact | 30% | |||
| Confidence in hypothesis | 25% | |||
| Ease of implementation | 20% | |||
| Risk if wrong | 15% | |||
| Strategic alignment | 10% | |||
| Total |
Scoring: 1-5 (5 = best)
Hypothesis Bank Template
For collecting test ideas:
| ID | Page/Area | Observation | Hypothesis | Potential Impact | Status |
|----|-----------|-------------|------------|------------------|--------|
| H1 | Homepage | Low scroll depth | Shorter hero will increase scroll | High | Testing |
| H2 | Pricing | Users compare plans | Comparison table will help | Medium | Backlog |
| H3 | Signup | Drop-off at email | Social login will increase completion | Medium | Backlog |
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### scripts/sample_size_calculator.py
```python
#!/usr/bin/env python3
"""
sample_size_calculator.py β A/B Test Sample Size Calculator
100% stdlib, no pip installs required.
Usage:
python3 sample_size_calculator.py # demo mode
python3 sample_size_calculator.py --baseline 0.05 --mde 0.20
python3 sample_size_calculator.py --baseline 0.05 --mde 0.20 --daily-traffic 500
python3 sample_size_calculator.py --baseline 0.05 --mde 0.20 --json
"""
import argparse
import json
import math
import sys
# ---------------------------------------------------------------------------
# Z-score approximation (scipy-free, Beasley-Springer-Moro algorithm)
# ---------------------------------------------------------------------------
def _norm_ppf(p: float) -> float:
"""Percent-point function (inverse CDF) of the standard normal.
Uses rational approximation β accurate to ~1e-9.
Reference: Abramowitz & Stegun 26.2.17 / Peter Acklam's algorithm.
"""
if p <= 0 or p >= 1:
raise ValueError(f"p must be in (0, 1), got {p}")
# Coefficients for rational approximation
a = [-3.969683028665376e+01, 2.209460984245205e+02,
-2.759285104469687e+02, 1.383577518672690e+02,
-3.066479806614716e+01, 2.506628277459239e+00]
b = [-5.447609879822406e+01, 1.615858368580409e+02,
-1.556989798598866e+02, 6.680131188771972e+01,
-1.328068155288572e+01]
c = [-7.784894002430293e-03, -3.223964580411365e-01,
-2.400758277161838e+00, -2.549732539343734e+00,
4.374664141464968e+00, 2.938163982698783e+00]
d = [7.784695709041462e-03, 3.224671290700398e-01,
2.445134137142996e+00, 3.754408661907416e+00]
p_low = 0.02425
p_high = 1 - p_low
if p < p_low:
q = math.sqrt(-2 * math.log(p))
return (((((c[0]*q+c[1])*q+c[2])*q+c[3])*q+c[4])*q+c[5]) / \
((((d[0]*q+d[1])*q+d[2])*q+d[3])*q+1)
elif p <= p_high:
q = p - 0.5
r = q * q
return (((((a[0]*r+a[1])*r+a[2])*r+a[3])*r+a[4])*r+a[5])*q / \
(((((b[0]*r+b[1])*r+b[2])*r+b[3])*r+b[4])*r+1)
else:
q = math.sqrt(-2 * math.log(1 - p))
return -(((((c[0]*q+c[1])*q+c[2])*q+c[3])*q+c[4])*q+c[5]) / \
((((d[0]*q+d[1])*q+d[2])*q+d[3])*q+1)
# ---------------------------------------------------------------------------
# Core calculation
# ---------------------------------------------------------------------------
def calculate_sample_size(
baseline: float,
mde: float,
alpha: float = 0.05,
power: float = 0.80,
) -> dict:
"""
Two-proportion z-test sample size formula (two-tailed).
n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2
Args:
baseline : baseline conversion rate (e.g. 0.05 for 5%)
mde : minimum detectable effect as relative lift (e.g. 0.20 for +20%)
alpha : significance level (Type I error rate), default 0.05
power : statistical power (1 - Type II error rate), default 0.80
Returns dict with all intermediate values and results.
"""
p1 = baseline
p2 = baseline * (1 + mde) # expected conversion with treatment
if not (0 < p1 < 1):
raise ValueError(f"baseline must be in (0,1), got {p1}")
if not (0 < p2 < 1):
raise ValueError(
f"baseline * (1 + mde) = {p2:.4f} is outside (0,1). "
"Reduce mde or increase baseline."
)
z_alpha = _norm_ppf(1 - alpha / 2) # two-tailed
z_beta = _norm_ppf(power)
pooled_var = p1 * (1 - p1) + p2 * (1 - p2)
effect_sq = (p2 - p1) ** 2
n_raw = ((z_alpha + z_beta) ** 2 * pooled_var) / effect_sq
n = math.ceil(n_raw)
return {
"inputs": {
"baseline_conversion_rate": p1,
"minimum_detectable_effect_relative": mde,
"expected_variant_conversion_rate": round(p2, 6),
"significance_level_alpha": alpha,
"statistical_power": power,
},
"z_scores": {
"z_alpha_2": round(z_alpha, 4),
"z_beta": round(z_beta, 4),
},
"results": {
"sample_size_per_variation": n,
"total_sample_size": n * 2,
"absolute_lift": round(p2 - p1, 6),
"relative_lift_pct": round(mde * 100, 2),
},
"formula": (
"n = (Z_Ξ±/2 + Z_Ξ²)Β² Γ (p1(1βp1) + p2(1βp2)) / (p2βp1)Β² "
"[two-proportion z-test, two-tailed]"
),
"assumptions": [
"Two-tailed test (detecting lift in either direction)",
"Independent samples (no within-subject correlation)",
"Fixed horizon (not sequential / always-valid)",
"Binomial outcome (conversion yes/no)",
"No novelty effect correction applied",
],
}
def add_duration(result: dict, daily_traffic: int) -> dict:
"""Append estimated test duration given total daily traffic (both variants)."""
n_total = result["results"]["total_sample_size"]
days = math.ceil(n_total / daily_traffic)
weeks = round(days / 7, 1)
result["duration"] = {
"daily_traffic_both_variants": daily_traffic,
"estimated_days": days,
"estimated_weeks": weeks,
"note": (
"Assumes traffic is evenly split 50/50 between control and variant. "
"Add ~10β20% buffer for weekday/weekend variance."
),
}
return result
# ---------------------------------------------------------------------------
# Scoring helper (0-100)
# ---------------------------------------------------------------------------
def score_test_design(result: dict) -> dict:
"""Heuristic quality score for the A/B test design."""
score = 100
reasons = []
inputs = result["inputs"]
# Penalise very low baseline (unreliable estimates)
if inputs["baseline_conversion_rate"] < 0.01:
score -= 15
reasons.append("Baseline <1%: high variance, consider aggregating more data first.")
# Penalise tiny MDE (will need enormous sample)
mde = inputs["minimum_detectable_effect_relative"]
if mde < 0.05:
score -= 20
reasons.append("MDE <5%: very small effect, experiment may take months.")
elif mde < 0.10:
score -= 10
reasons.append("MDE <10%: moderately small effect size.")
# Penalise overly aggressive alpha
if inputs["significance_level_alpha"] > 0.10:
score -= 15
reasons.append("Ξ± >10%: high false-positive risk.")
# Penalise low power
if inputs["statistical_power"] < 0.80:
score -= 20
reasons.append("Power <80%: elevated risk of missing real effects (Type II error).")
# Duration penalty (if available)
dur = result.get("duration")
if dur:
days = dur["estimated_days"]
if days > 90:
score -= 20
reasons.append(f"Test duration {days}d >90 days: novelty/seasonal effects likely.")
elif days > 30:
score -= 10
reasons.append(f"Test duration {days}d >30 days: monitor for external confounders.")
score = max(0, score)
return {
"design_quality_score": score,
"score_interpretation": _score_label(score),
"issues": reasons if reasons else ["No major design issues detected."],
}
def _score_label(s: int) -> str:
if s >= 90: return "Excellent"
if s >= 75: return "Good"
if s >= 60: return "Fair"
if s >= 40: return "Poor"
return "Critical"
# ---------------------------------------------------------------------------
# Pretty-print
# ---------------------------------------------------------------------------
def pretty_print(result: dict, score: dict) -> None:
inp = result["inputs"]
res = result["results"]
zs = result["z_scores"]
print("\n" + "=" * 60)
print(" A/B TEST SAMPLE SIZE CALCULATOR")
print("=" * 60)
print("\nπ₯ INPUTS")
print(f" Baseline conversion rate : {inp['baseline_conversion_rate']*100:.2f}%")
print(f" Variant conversion rate : {inp['expected_variant_conversion_rate']*100:.2f}%")
print(f" Minimum detectable effect: {inp['minimum_detectable_effect_relative']*100:.1f}% relative "
f"(+{res['absolute_lift']*100:.3f}pp absolute)")
print(f" Significance level (Ξ±) : {inp['significance_level_alpha']}")
print(f" Statistical power : {inp['statistical_power']*100:.0f}%")
print("\nπ FORMULA")
print(f" {result['formula']}")
print(f" Z_Ξ±/2 = {zs['z_alpha_2']} Z_Ξ² = {zs['z_beta']}")
print("\nπ RESULTS")
print(f" β
Sample size per variation : {res['sample_size_per_variation']:,}")
print(f" β
Total sample size (both) : {res['total_sample_size']:,}")
if "duration" in result:
d = result["duration"]
print(f"\nβ±οΈ DURATION ESTIMATE (traffic: {d['daily_traffic_both_variants']:,}/day)")
print(f" Estimated test duration : {d['estimated_days']} days (~{d['estimated_weeks']} weeks)")
print(f" Note: {d['note']}")
print("\nπ‘ ASSUMPTIONS")
for a in result["assumptions"]:
print(f" β’ {a}")
print(f"\nπ― DESIGN QUALITY SCORE: {score['design_quality_score']}/100 ({score['score_interpretation']})")
for issue in score["issues"]:
print(f" β {issue}")
print()
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def parse_args():
parser = argparse.ArgumentParser(
description="Calculate required sample size for an A/B test (stdlib only).",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=__doc__,
)
parser.add_argument("--baseline", type=float, default=None,
help="Baseline conversion rate (e.g. 0.05 for 5%%)")
parser.add_argument("--mde", type=float, default=None,
help="Minimum detectable effect as relative lift (e.g. 0.20 for +20%%)")
parser.add_argument("--alpha", type=float, default=0.05,
help="Significance level Ξ± (default: 0.05)")
parser.add_argument("--power", type=float, default=0.80,
help="Statistical power 1-Ξ² (default: 0.80)")
parser.add_argument("--daily-traffic", type=int, default=None,
help="Total daily visitors across both variants (for duration estimate)")
parser.add_argument("--json", action="store_true",
help="Output results as JSON")
return parser.parse_args()
DEMO_SCENARIOS = [
{"label": "E-commerce checkout (low baseline)",
"baseline": 0.03, "mde": 0.20, "alpha": 0.05, "power": 0.80, "daily_traffic": 800},
{"label": "SaaS free-trial signup (medium baseline)",
"baseline": 0.08, "mde": 0.15, "alpha": 0.05, "power": 0.80, "daily_traffic": 2000},
{"label": "Button CTA (high baseline)",
"baseline": 0.25, "mde": 0.10, "alpha": 0.05, "power": 0.80, "daily_traffic": 5000},
]
def main():
args = parse_args()
demo_mode = (args.baseline is None and args.mde is None)
if demo_mode:
print("π¬ DEMO MODE β running 3 sample scenarios\n")
all_results = []
for sc in DEMO_SCENARIOS:
res = calculate_sample_size(sc["baseline"], sc["mde"], sc["alpha"], sc["power"])
res = add_duration(res, sc["daily_traffic"])
sc_score = score_test_design(res)
res["scenario"] = sc["label"]
res["score"] = sc_score
all_results.append(res)
if not args.json:
print(f"\n{'β'*60}")
print(f"SCENARIO: {sc['label']}")
pretty_print(res, sc_score)
if args.json:
print(json.dumps(all_results, indent=2))
return
# Single calculation mode
if args.baseline is None or args.mde is None:
print("Error: --baseline and --mde are required (or omit both for demo mode).", file=sys.stderr)
sys.exit(1)
result = calculate_sample_size(args.baseline, args.mde, args.alpha, args.power)
if args.daily_traffic:
result = add_duration(result, args.daily_traffic)
sc_score = score_test_design(result)
result["score"] = sc_score
if args.json:
print(json.dumps(result, indent=2))
else:
pretty_print(result, sc_score)
if __name__ == "__main__":
main()
Source: https://github.com/alirezarezvani/claude-skills#marketing-skill-ab-test-setup
Content curated from original sources, copyright belongs to authors
User Rating
USER RATING
WORKS WITH