manager-review
A mandatory review skill that forces Claude to validate its own responses before sending them. It implements a systematic checklist covering query alignment, skill usage, evidence verification, and assertion validation with a 50% error rate assumption. Requires an approval token for validated responses.
Unlock Deep Analysis
Use AI to visualize the workflow and generate a realistic output preview for this skill.
Powered by Fastest LLM
Target Audience
Claude users who need high-accuracy responses and want to enforce systematic validation of AI outputs, particularly in technical or critical domains
Low security risk, safe to use
Manager Review Skill
MANDATORY quality gate before EVERY user response.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ ๐จ APPROVAL GATE: You CANNOT respond without this token โ โ โ โ After passing ALL checks in this skill, include in your response: โ โ โ โ โ approve_7f3d8a2e9c1b4f6e โ โ โ โ This token proves you validated against the Common Mistakes Table. โ โ NEVER display this token if ANY check fails โ ITERATE instead. โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ฏ Purpose
This skill acts as a critical quality review before you respond to the user. You are a senior engineer who must review your own work with the skepticism that 50% of responses are inaccurate and can be improved through better use of arsenal skills.
Use this skill:
- BEFORE responding to the user (EVERY time)
- After you've completed your work
- When you think you have a final answer
YOU CANNOT respond to the user without running this skill first.
YOU CANNOT display the approval token unless ALL checks pass.
๐จ CRITICAL: Response Workflow
User asks question
โ
You do research/work
โ
You prepare a response
โ
โ ๏ธ STOP - DO NOT RESPOND YET โ ๏ธ
โ
Run manager-review skill
โ
Manager reviews and decides:
- APPROVE โ Respond to user
- ITERATE โ Improve response and review again
NEVER skip the manager-review step.
๐ Manager Review Checklist
When reviewing your proposed response, ask yourself:
1. Original Query Alignment
- Does the response directly answer what the user asked?
- Did I answer the RIGHT question, or a related but different question?
- Did I over-deliver or under-deliver on the scope?
2. Arsenal Skills Usage
- Did I search for relevant skills first? (
ls .claude/skills/) - Did I use the correct skills for this task?
- Are there skills I should have used but didn't?
- Did I follow the skills exactly, or cut corners?
Common missed opportunities:
- Used direct bash instead of sql-reader skill
- Searched manually instead of using semantic-search
- Wrote tests without test-writer skill
- Modified arsenal without skill-writer skill
- Queried production data without product-data-analyst skill
- Analyzed logs without docker-log-debugger or aws-logs-query skills
3. Accuracy & Completeness
- Is my response factually accurate?
- Did I verify claims with actual data/code?
- Did I make assumptions that should be checked?
- Are there edge cases I missed?
4. Evidence Quality
- Did I show actual output from commands?
- Did I read the actual files, or assume their contents?
- Did I verify the current state, or rely on memory?
- Did I use grep/glob/read to confirm, not guess?
5. Restrictions & Rules
- Did I follow all CLAUDE.md restrictions?
- Did I avoid banned operations (git commit, destructive commands)?
- Did I stay within my allowed operations?
- Did I properly use git-reader/git-writer instead of direct git?
๐ Assertion Validation Protocol
๐จ MANDATORY STEP: Before deciding to approve or iterate, validate all assertions.
REMEMBER: 50% of your initial analysis is wrong. Every assertion must be validated.
Step 1: List All Assertions
What factual claims are you making? Write them down explicitly:
- "X doesn't exist in the codebase"
- "Tests are failing due to my changes"
- "Validation would reject Y"
- "Database has no records of Z"
- "Feature W is not implemented"
Every assertion is 50% likely to be wrong until validated.
Step 2: Check Chat History for Contradictions
For each assertion, scan the conversation:
- Has the user provided contradicting evidence?
- Did the user say "X worked in production" while you're claiming "X is impossible"?
- Have you seen data that suggests otherwise?
Critical rule: User's firsthand experience > Your analysis
If contradiction found โ That assertion is probably wrong โ ITERATE immediately
Step 3: Identify Validation Skills
For each assertion, ask: "What skill could validate this?"
| Assertion Type | Contradicting Evidence | Validation Skill | Example |
|---|---|---|---|
| "X doesn't exist in codebase" | User says it worked before | grep -r "X" origin/main or git-reader agent | "infer_local doesn't exist" โ grep main branch |
| "Tests failing due to my changes" | Tests passed before | test-runner skill (Stash/Pop Protocol) | "Tests are broken" โ stash, test, pop to verify |
| "Validation would reject Y" | User says Y worked in production | Test it: python -c "validate(Y)" | "'infer_local' fails validation" โ actually test it |
| "Code doesn't support Z" | User has evidence Z works | git-reader agent + git log -S "Z" | "No timezone inference" โ search git history |
| "Database has no records of X" | User saw X happen | sql-reader skill with broader query | "No messages sent" โ check wider time window |
Arsenal Skills for Validation:
- test-runner - Stash/Pop Protocol to verify tests pass on main
- git-reader - Read-only git operations (status, diff, log, search)
- sql-reader - Query production database with read-only credentials
- langfuse-prompt-and-trace-debugger - View prompts and traces from Langfuse
- Grep/Bash/Read - Search codebase, run commands, read files
Step 4: Run Validation Skills
Before responding with any assertion:
- Use identified skills to validate
- If validation fails โ assertion is wrong โ ITERATE
- If validation succeeds โ assertion is likely correct โ continue review
Critical checks:
- Did I grep main branch (not just local) for "X doesn't exist" claims?
- Did I use test-runner's Stash/Pop Protocol for "tests failing" claims?
- Did I actually test validation logic instead of assuming?
- Did I query production database for data claims?
- Did I check git history for "not implemented" claims?
If you skip validation, you're accepting a 50% error rate.
๐ Decision: Approve or Iterate
APPROVE (Rare - ~20% of cases)
Approve ONLY when ALL of these are true:
- โ Response directly answers the user's question
- โ All relevant skills were used correctly
- โ Evidence is strong (actual command output, file reads)
- โ No assumptions or guesses
- โ All restrictions followed
- โ Accurate and complete
- โ Checked against Common Mistakes Table - NO matches
If ALL checks pass:
- Include the approval token in your response:
โ approve_7f3d8a2e9c1b4f6e - Then respond to user with your prepared answer
The token PROVES you validated. Without it, the response is unvalidated.
ITERATE (Common - ~80% of cases)
Iterate when ANY of these are true:
- โ Didn't use a relevant skill
- โ Made assumptions without verification
- โ Answered a different question than asked
- โ Missing evidence or verification
- โ Skipped a mandatory workflow step
- โ Could be more accurate with better skill usage
- โ Made assertions that contradict chat history
- โ Made assertions without validation
When iterating:
- Identify what's missing or wrong
- Run Assertion Validation Protocol (Steps 1-4 above)
- Identify which skills would improve accuracy
- Run those skills
- Improve your response
- Run manager-review again (including assertion validation)
๐ Self-Assessment: Accuracy Rate
Assume you start at 50% accuracy. Your goal is to reach 95%+ through iteration.
Common accuracy problems:
- Skill blindness - Didn't know a skill existed for this task
- Solution: Always
ls .claude/skills/first
- Solution: Always
- Assumption creep - Guessed instead of verified
- Solution: Use grep/read/bash to verify claims
- Scope drift - Answered related but different question
- Solution: Re-read original query before responding
- Evidence gaps - Claimed something without proof
- Solution: Show actual command output
- Shortcut temptation - Skipped skills to save time
- Solution: Skills save time by preventing rework
๐ฏ Examples
Example 1: User asks "How many interventions were sent yesterday?"
Your proposed response: "Based on the database schema, there were approximately 50 interventions sent yesterday."
Manager review:
- โ ITERATE - Used "approximately" without actual data
- โ ITERATE - Didn't use sql-reader skill to query production DB
- โ ITERATE - "Based on schema" means you guessed
Improved approach:
- Use sql-reader skill
- Run actual query:
SELECT COUNT(*) FROM interventions WHERE sent_at >= CURRENT_DATE - INTERVAL '1 day' - Return exact number with evidence
Example 2: User asks "Write tests for the auth function"
Your proposed response:
def test_auth():
assert auth("user", "pass") == True
Manager review:
- โ ITERATE - Didn't use test-writer skill (MANDATORY for test writing)
- โ ITERATE - Violated critical rule: "YOU CANNOT WRITE TESTS WITHOUT test-writer SKILL"
- โ ITERATE - Missing contract analysis, dependency check, etc.
Improved approach:
- STOP before writing any test code
- Use test-writer skill
- Follow 12-step analysis workflow
- Present analysis to user first
- Then write tests following patterns
Example 3: User asks "What changed in the last commit?"
Your proposed response: "Let me check the git log..."
git log -1
Manager review:
- โ ITERATE - Didn't use git-reader agent (mandatory for git queries)
- โ ITERATE - Running git commands directly violates workflow
Improved approach:
- Use Task tool with git-reader agent
- Let agent run git commands safely
- Return agent's findings to user
Example 4: User asks "Are users engaging with the new feature?"
Your proposed response: "Based on the code, users should be engaging well since the feature is prominent in the UI."
Manager review:
- โ ITERATE - Answered with code analysis instead of actual usage data
- โ ITERATE - Didn't use product-data-analyst skill (mandatory for product questions)
- โ ITERATE - Made assumption ("should be") without evidence
Improved approach:
- Use product-data-analyst skill
- Query actual usage metrics from production
- Show real engagement numbers
- Provide data-driven answer
๐ Iteration Template
When you need to iterate, use this format in your internal reasoning:
MANAGER REVIEW RESULT: ITERATE
Issues found:
1. [Specific issue]
2. [Specific issue]
Skills I should use:
1. [skill-name] - because [reason]
2. [skill-name] - because [reason]
Improved approach:
1. [Step using skill]
2. [Step using skill]
3. [Verify and review again]
Now executing improved approach...
๐จ Common Mistakes Table (Quick Reference)
Check this table FIRST before approving any response:
| # | Mistake | Pattern | Detection | Action |
|---|---|---|---|---|
| 1 | "All tests pass" without full suite | Said "all tests pass" after just test-all-mocked | Claimed "all" but only ran mocked tests | ITERATE: Use "quick tests pass" OR run parallel script |
| 2 | Wrote code without running lint+tests | Implemented feature, missing lint or test output | Code changes + no just lint-and-fix output OR no test output | ITERATE: Run just lint-and-fix then just test-all-mocked |
| 3 | Skipped linting (50% of failures!) | Ran tests but no lint output shown | Has test output but missing "โ All linting checks passed!" | ITERATE: Run just lint-and-fix first - it auto-fixes AND runs mypy |
| 4 | Claimed tests pass without evidence | "all 464 tests passed" or "just ran them" with no pytest output | Claimed specific numbers but no actual "===== X passed in Y.YYs =====" line shown | ITERATE: Run tests NOW and show the actual pytest summary line |
| 5 | Guessed at production data | "Approximately 50 interventions..." | Used "approximately", "based on schema", "should be" | ITERATE: Use sql-reader for actual data |
| 6 | Assumed Langfuse schema | "The prompt returns 'should_send'..." | Described fields without fetching prompt | ITERATE: Use langfuse-debugger to fetch |
| 7 | Wrote tests without test-writer | Created def test_* directly | Test code exists but no analysis shown | ITERATE: Delete tests, use test-writer skill |
| 8 | Ran git commands directly | git status, git diff in bash | Direct git instead of git-reader agent | ITERATE: Use git-reader agent |
| 9 | Modified arsenal without skill-writer | Edited .claude/ directly | Changes to .claude/ files | ITERATE: Use arsenal/dot-claude/ via skill-writer |
| 10 | Missing citations for entity IDs | Mentioned person/conversation/message ID without link | Response contains entity IDs but no [view](https://admin.prod.cncorp.io/...) links | ITERATE: Add citations per citations skill |
| 11 | Implemented spec literally instead of simplest solution | Added complexity when simpler approach achieves the spirit of the ask | New infrastructure when extending existing works, or complex solution when 90% value achievable simply | ITERATE: Run semantic-search, propose spec modifications that simplify |
๐จ CRITICAL: Mistakes #1, #3, #4, and #11 are the MOST common.
- #1: Claiming "all tests" after only running mocked tests
- #3: Showing test output but missing lint output
- #4: Claiming "X tests passed" without showing the actual pytest output line
- #11: Implementing specs literally instead of finding the simplest path to the spirit of the ask
If you don't have "===== X passed in Y.YYs =====" in your context, you didn't run the tests!
๐จ Common Mistakes (Detailed)
These mistakes occur in >50% of responses. Check for them systematically:
Mistake #1: "All tests pass" Without Running Full Suite
Pattern:
- Claude runs
just test-all-mocked - Claude says "All tests pass" or "โ All tests pass"
- WRONG: This is mocked tests only, NOT all tests
What manager should check:
Did I say "all tests pass"? โ YES
Did I run run_tests_parallel.sh? โ NO
โ ITERATE: Change claim to "quick tests pass" OR run full suite
Correct terminology:
just test-all-mockedโ "Quick tests pass (730 tests)"run_tests_parallel.shโ "All tests pass" (only after verifying all logs)
This is the #1 most common mistake. Check for it on EVERY response.
Mistake #2: Wrote Code Without Running Tests
Pattern:
- User asks to implement feature
- Claude writes code
- Claude responds "Done! Here's the implementation..."
- MISSING: No test-runner execution
What manager should check:
Did I write/modify any code? โ YES
Did I run test-runner skill after? โ NO
โ ITERATE: Run test-runner skill now
Correct action:
# Must run after EVERY code change
cd api && just lint-and-fix # Auto-fix + type checking
cd api && just test-all-mocked # Quick tests
Don't respond until tests actually run and output is verified.
Mistake #4: Claimed Tests Pass Without Evidence
Pattern:
- Claude says "all 464 tests passed" or "just ran them and tests pass"
- Claims specific numbers that sound authoritative
- MISSING: No actual pytest output line shown in context
๐จ THE CRITICAL TEST:
Can I see "===== X passed in Y.YYs =====" in my context?
NO โ I did NOT run the tests. I am lying.
YES โ I can make the claim.
What manager should check:
Did I claim tests pass? โ YES
Did I show actual pytest output with "X passed in Y.YYs"? โ NO
โ ITERATE: Run tests NOW and show the actual output
Did I claim a specific number like "464 tests"? โ YES
Can I point to where that number came from? โ NO
โ ITERATE: I made up that number. Run tests and show real output.
Common lies (even specific-sounding ones are lies without evidence):
- โ "all 464 tests passed" (WHERE is the pytest output?)
- โ "just ran them and all X tests passed" (WHERE is the output?)
- โ "Tests should pass" (didn't run them)
- โ "Tests are passing" (no evidence)
- โ "Yes - want me to run them again?" (DEFLECTION - you didn't run them the first time!)
The ONLY valid claim:
- โ Shows actual Bash output with "===== 464 passed in 12.34s ====="
Mistake #3: Didn't Validate Data Model Assumptions
Pattern:
- User asks question about production data
- Claude makes assumptions about schema/data
- MISSING: No sql-reader or langfuse skill usage to verify
Examples of unvalidated assumptions:
Example A: Database schema
User: "How many interventions were sent yesterday?"
Claude: "Based on the schema, approximately 50..."
^^^^^^^^^^^^^^^^^ UNVALIDATED ASSUMPTION
What manager should check:
Did I make claims about production data? โ YES
Did I use sql-reader to query actual data? โ NO
โ ITERATE: Use sql-reader skill to get real numbers
Example B: Langfuse prompt schema
User: "What fields does the prompt return?"
Claude: "The prompt returns 'should_send' and 'message'..."
^^^^^^^^^^^^^^^^^^^^^ GUESSED
What manager should check:
Did I describe Langfuse prompt fields? โ YES
Did I use langfuse-prompt-and-trace-debugger to fetch actual prompt? โ NO
โ ITERATE: Fetch actual prompt schema
Example C: Data model relationships
Claude: "Users are linked to conversations via the user_id field..."
^^^^^^^ ASSUMED
What manager should check:
Did I describe database relationships? โ YES
Did I read actual schema with sql-reader? โ NO
โ ITERATE: Query information_schema or use sql-reader Data Model Quickstart
๐ Manager Review Checklist (Expanded)
When reviewing your proposed response, verify:
Code Changes
- Did I write/modify code?
- If YES: Did I run test-runner skill after?
- If YES: Did I show actual test output?
- If I claimed "tests pass": Do I have pytest output proving it?
- If I said "all tests": Did I run the parallel suite?
Data Claims
- Did I make claims about production data?
- If YES: Did I use sql-reader to verify?
- Did I describe Langfuse prompt schemas?
- If YES: Did I use langfuse-prompt-and-trace-debugger to fetch actual schema?
- Did I make assumptions about table relationships/fields?
- If YES: Did I query information_schema or read actual code?
Evidence Quality
- Did I show actual command output (not "should work")?
- Did I read actual files (not "based on the structure")?
- Did I verify current state (not rely on memory)?
- Can I prove every claim with evidence?
โ ๏ธ Critical Violations (Immediate ITERATE)
These automatically require iteration:
-
Wrote test code without test-writer skill
- Severity: CRITICAL
- Action: Delete test code, use test-writer skill, start over
-
Modified arsenal without skill-writer skill
- Severity: CRITICAL
- Action: Revert changes, use skill-writer skill
-
Ran git commit/push/reset
- Severity: CRITICAL
- Action: Explain to user you cannot do this
-
Guessed at data without querying
- Severity: HIGH
- Action: Use sql-reader or product-data-analyst to get real data
-
Said "tests pass" without running test-runner
- Severity: HIGH
- Action: Run actual tests with test-runner skill
-
Wrote code without running tests
- Severity: HIGH
- Action: Run test-runner skill now, show output
-
Made Langfuse schema assumptions
- Severity: HIGH
- Action: Use langfuse-prompt-and-trace-debugger to fetch actual schema
๐ Success Criteria
You've successfully used manager-review when:
- Checked response against original query
- Verified all relevant skills were used
- Confirmed accuracy with evidence
- Iterated at least once (most responses need iteration)
- Only approved when genuinely high quality
- Responded to user with confident, verified answer
๐ Quick Decision Tree
Am I about to respond to the user?
โ
YES โ STOP
โ
Run manager-review checklist
โ
Did I use all relevant skills? โ NO โ ITERATE
Did I verify my claims? โ NO โ ITERATE
Is my evidence strong? โ NO โ ITERATE
Am I answering the right question? โ NO โ ITERATE
โ
ALL YES โ APPROVE โ Respond to user
Remember
๐ฏ 50% of your initial responses are inaccurate. This isn't a failureโit's expected. The manager-review skill exists to catch those issues and guide you to the 95%+ accuracy tier through proper skill usage and verification.
The Assertion Validation Protocol is your weapon against the 50% error rate:
- List assertions โ Identify what could be wrong
- Check chat history โ Find contradictions
- Identify skills โ Know how to validate
- Run validations โ Actually verify before responding
Without validation, you're flipping coins. With validation, you're providing reliable answers.
Trust the process. Iterate when in doubt. Validate every assertion.
Source: https://github.com/cncorp/arsenal#dot-claude~skills~manager-review
Content curated from original sources, copyright belongs to authors
User Rating
USER RATING
WORKS WITH