Sanitize Git Repository
This skill provides guidance for systematically identifying and replacing sensitive information in git repositories, including API keys, tokens, passwords, and other credentials.
When to Use This Skill
- Sanitizing a repository before sharing or open-sourcing
- Removing accidentally committed secrets from a codebase
- Replacing hardcoded credentials with placeholders
- Auditing a repository for sensitive information
- Preparing code for security review
Recommended Approach
Phase 1: Comprehensive Discovery
Build a complete inventory of all sensitive values before making any changes. This prevents the common mistake of discovering additional secrets mid-process.
Common Secret Patterns to Search:
| Type | Pattern/Prefix | Example |
|---|
| AWS Access Keys | AKIA[A-Z0-9]{16} | AKIAIOSFODNN7EXAMPLE |
| AWS Secret Keys | 40-char base64 strings | Often near access keys |
| GitHub Tokens | ghp_, gho_, ghs_, ghr_ | ghp_xxxxxxxxxxxx |
| Huggingface Tokens | hf_ | hf_xxxxxxxxxxxx |
| Generic API Keys | api[_-]?key, apikey | Varies |
| Bearer Tokens | bearer, token | Varies |
| Private Keys | -----BEGIN.*PRIVATE KEY----- | RSA/EC keys |
| Database URLs | postgres://, mysql://, mongodb:// | Connection strings |
| Password Fields | password, passwd, pwd | Varies |
Discovery Strategy:
- Search for known prefixes and patterns using regex
- Search for common variable names (
API_KEY, SECRET, TOKEN, PASSWORD, CREDENTIAL)
- Check configuration files (
.env, config.*, settings.*, *.json, *.yaml, *.yml)
- Examine test fixtures and mock data files
- Check embedded data within other formats (JSON encoded in strings, base64 encoded content)
Important Locations to Check:
- Configuration files and environment templates
- Test fixtures and sample data
- Documentation and README files
- CI/CD configuration files
- Docker and deployment configurations
- JSON/YAML data files (secrets may be embedded in structured data)
- Git diff outputs or patch files stored in the repository
Phase 2: Inventory Documentation
Before making changes, create a documented list of:
- Each unique sensitive value found
- All file locations where each value appears
- The exact string to match (including surrounding context if needed for uniqueness)
- The placeholder to use for replacement
Placeholder Conventions:
- Use descriptive placeholders that indicate the type:
<AWS_ACCESS_KEY>, <GITHUB_TOKEN>, <DATABASE_PASSWORD>
- Maintain consistency across all replacements
- Match the format specified by the user if provided
Phase 3: Systematic Replacement
Execute replacements methodically:
- Work through the inventory one secret at a time
- For each secret, replace ALL occurrences across ALL files
- Verify each replacement immediately after making it
- Use exact string matching to avoid unintended modifications
Replacement Best Practices:
- Read files before editing to ensure exact string matching
- Handle whitespace and formatting precisely
- Consider secrets embedded in JSON or other structured formats (may require escaping)
- Use batch replacements when the same value appears multiple times in one file
Phase 4: Verification
After all replacements:
- Re-run all original discovery searches to confirm no secrets remain
- Search for partial matches of sensitive values
- Run any provided test suites to validate the sanitization
- Check that placeholders are properly formatted
Common Pitfalls
1. Incomplete Initial Discovery
Problem: Secrets discovered incrementally during the replacement process, requiring backtracking.
Prevention: Invest time upfront in comprehensive searching using multiple patterns and checking all file types, including data files and embedded content.
2. Secrets in Unexpected Locations
Problem: Secrets appear in JSON files, embedded strings, test fixtures, or encoded data that aren't found by simple searches.
Prevention: Search recursively through all file types. Check JSON and YAML files specifically. Look for base64-encoded content that might contain secrets.
3. Exact String Matching Failures
Problem: Edit operations fail due to whitespace differences or character encoding issues.
Prevention: Always read the target file first to understand exact formatting. Copy strings exactly as they appear in the file, including whitespace.
4. Missing Occurrences
Problem: Multiple occurrences of the same secret exist, but only some are replaced.
Prevention: After each replacement, immediately verify by searching for the original value again. Use replace-all functionality when available.
5. Git History Considerations
Problem: Secrets remain in git history even after being removed from current files.
Prevention: Clarify with the user whether git history sanitization is required. If so, tools like git filter-branch or BFG Repo-Cleaner may be needed (this is a separate, more complex operation).
6. Tool Availability Assumptions
Problem: Specialized search tools (like ripgrep) may not be available in all environments.
Prevention: Have fallback approaches ready. Standard grep -r works in most environments. Test tool availability before relying on it.
Verification Checklist
Before considering the task complete:
Search Commands Reference
Standard grep (widely available):
# Search for AWS access keys
grep -rn "AKIA[A-Z0-9]\{16\}" . --exclude-dir=.git
# Search for common token prefixes
grep -rn "ghp_\|gho_\|ghs_\|hf_" . --exclude-dir=.git
# Search for password-related strings
grep -rn -i "password\|passwd\|pwd" . --exclude-dir=.git
# Search for API key patterns
grep -rn -i "api[_-]\?key\|apikey" . --exclude-dir=.git
Always exclude the .git directory from searches to avoid noise from git internals.