Web to Markdown
ๆฆ่ฟฐ / Overview
้็จ็ฝ้กตๆๅๅทฅๅ
ท๏ผๆฏๆ๏ผ
A general-purpose web scraping tool that supports:
- ๅฐ็ฝ้กตๅ
ๅฎน่ฝฌๆขไธบๅนฒๅ็ Markdown / Converting web content to clean Markdown
- ไปไปปๆ็ฝ็ซๆๅๅพ็ URL / Extracting image URLs from any website
- ๆน้ไธ่ฝฝ็ฝ้กตๅพ็ / Batch downloading images from web pages
้็จไบๅ
ๅฎน้
่ฏปใๅพ็ๆถ้ใ่ตๆๆด็็ญๅบๆฏใ
Suitable for content reading, image collection, and data organization.
ๅ่ฝๆจกๅ / Features
1. ็ฝ้กต่ฝฌ Markdown / Web to Markdown
ๅฐ็ฝ้กต URL ่ฝฌๆขไธบๅนฒๅ็ Markdown ๆๆฌ๏ผ็งป้คๅนฟๅใๅฏผ่ชๆ ็ญๆ ๅ
ณๅ
ๅฎนใ
Converts a web page URL into clean Markdown text, removing ads, navigation bars, and other irrelevant content.
URL ๅ็ผๆๅก / URL Prefix Services๏ผ
| ๆๅก Service | ๅ็ผ Prefix | ็น็น Notes |
|---|
| markdown.new | https://markdown.new/ | ้ฆ้๏ผ้ๅบฆๅฟซ / Preferred, fast |
| defuddle | https://defuddle.md/ | ๅค้ / Fallback |
| r.jina.ai | https://r.jina.ai/ | ้ๅๅจๆๅ
ๅฎน / Good for dynamic content |
ไฝฟ็จ / Usage๏ผ
curl -s "https://markdown.new/https://example.com/article"
curl -s "https://r.jina.ai/https://example.com/article"
2. ๆๅ็ฝ้กตๅพ็ / Extract Images from Web Pages
ไปไปปๆ็ฝ้กตๆๅๆๆๅพ็ URLใ
Extracts all image URLs from any web page.
้็จๆๅ / General Extraction๏ผ
# ๆๅๆๆๅพ็ URL / Extract all image URLs
curl -s "https://r.jina.ai/<url>" | grep -oE 'https://[^)\s"]+\.(jpg|jpeg|png|gif|webp|avif)'
ไฝฟ็จ่ๆฌ / Using the Script๏ผ
python scripts/extract_images.py <url> [--output urls.txt]
3. ๆน้ไธ่ฝฝๅพ็ / Batch Download Images
ไป็ฝ้กตๆๅๅพ็ๅนถๆน้ไธ่ฝฝๅฐๆฌๅฐใ
Extracts images from web pages and downloads them in batch to local storage.
ไฝฟ็จ่ๆฌ / Using the Script๏ผ
python scripts/download_images.py <url> [--output <dir>] [--limit <n>] [--min-size <bytes>]
ๅๆฐ / Parameters๏ผ
url: ็ฝ้กต URL / Web page URL
--output: ่พๅบ็ฎๅฝ๏ผ้ป่ฎค ~/.openclaw/images๏ผ/ Output directory (default: ~/.openclaw/images)
--limit: ๆๅคงไธ่ฝฝๆฐ๏ผ้ป่ฎค 50๏ผ/ Max downloads (default: 50)
--min-size: ๆๅฐๆไปถๅคงๅฐ๏ผ่ฟๆปคๅฐๅพๆ ๏ผ้ป่ฎค 10KB๏ผ/ Min file size to filter out small icons (default: 10KB)
--ext: ๅชไธ่ฝฝๆๅฎๆ ผๅผ๏ผjpg/png/gif/webp๏ผ/ Only download specific formats (jpg/png/gif/webp)
็คบไพ / Examples๏ผ
# ไธ่ฝฝ็ฝ้กตไธญ็ๆๆๅคงๅพ / Download all large images from a page
python scripts/download_images.py "https://example.com/gallery" --output ~/Downloads/images
# ๅชไธ่ฝฝ PNG๏ผๆๅค 20 ๅผ / Download only PNGs, max 20
python scripts/download_images.py "https://example.com" --ext png --limit 20
# Pinterest๏ผ่ชๅจ่ฝฌๆขๅๅงๅฐบๅฏธ๏ผ/ Pinterest (auto-converts to original size)
python scripts/download_images.py "https://www.pinterest.com/search/pins/?q=architecture"
ๅทฅไฝๆต็จ / Workflow
็ฝ้กตๅ
ๅฎนๆๅ / Web Content Scraping
- ้ฆ้
markdown.new/ / Prefer markdown.new/
- ๅคฑ่ดฅๅๅฐ่ฏ
defuddle.md/ / Fall back to defuddle.md/
- ๅๅคฑ่ดฅๅฐ่ฏ
r.jina.ai/ / Then try r.jina.ai/
- ๆ็ปไฝฟ็จๆฌๅฐ Scrapling ่ๆฌ / Finally use local Scrapling script
ๅพ็ๆๅไธ่ฝฝ / Image Extraction & Download
- ไฝฟ็จ
r.jina.ai ่ทๅ็ฝ้กตๅ
ๅฎน / Use r.jina.ai to fetch page content
- ๆญฃๅๆๅๆๆๅพ็ URL / Extract all image URLs via regex
- ่ฟๆปคๅฐๅพ็๏ผๅพๆ ใ่กจๆ
็ญ๏ผ/ Filter out small images (icons, emojis, etc.)
- ๆบ่ฝๅฝๅๅนถไธ่ฝฝไฟๅญ / Smart naming and download
็นๆฎ็ฝ็ซๆฏๆ / Special Website Support
Pinterest
่ชๅจ่ฏๅซ Pinterest URL๏ผๅฐ็ผฉ็ฅๅพ่ฝฌๆขไธบๅๅงๅฐบๅฏธ๏ผ
Automatically detects Pinterest URLs and converts thumbnails to original size:
236x โ originals
564x โ originals
ๅ
ถไปๅธธ่ง็ฝ็ซ / Other Common Websites
่ๆฌไผ่ชๅจๅค็ๅ็ง็ฝ็ซ็ๅพ็ URL ๆ ผๅผ๏ผๅ
ๆฌ๏ผ
The scripts automatically handle various image URL formats, including:
- CDN ้พๆฅ / CDN links
- ๅธฆๅๆฐ็ URL / URLs with query parameters
- ๆๅ ่ฝฝๅพ็ / Lazy-loaded images
่ๆฌ่ฏดๆ / Script Reference
scripts/scrape.py
ๆฌๅฐ็ฝ้กตๆๅ่ๆฌ๏ผไฝไธบๅจ็บฟๆๅก็้็บงๆนๆกใ
Local web scraping script, used as a fallback for online services.
python scripts/scrape.py <url>
scripts/extract_images.py
ๆๅ็ฝ้กตไธญ็ๅพ็ URL๏ผ่พๅบไธบๅ่กจใ
Extracts image URLs from a web page and outputs them as a list.
python scripts/extract_images.py <url> [--output urls.txt]
scripts/download_images.py
ๆน้ไธ่ฝฝ็ฝ้กตๅพ็ใ
Batch downloads images from a web page.
python scripts/download_images.py <url> [options]
ไพ่ต / Dependencies
extract_images.py ๅ download_images.py ไป
ไฝฟ็จ Python ๆ ๅๅบ๏ผๆ ้้ขๅคๅฎ่ฃ
ใ
extract_images.py and download_images.py only use the Python standard library โ no extra installation needed.
scrape.py ้่ฆๅฎ่ฃ
scrapling๏ผๆฌๅฐๆๅ้็บงๆนๆก๏ผ๏ผ
scrape.py requires scrapling (local scraping fallback):
pip install scrapling
ๆณจๆไบ้กน / Notes
- ้ตๅฎ็ฝ็ซ็ robots.txt ๅไฝฟ็จๆกๆฌพ / Respect the website's robots.txt and terms of use
- ๅคง้ไธ่ฝฝๅ่่็ฝ็ซๆๅกๅจๅๅ / Consider server load before mass downloading
- ้จๅ็ฝ็ซๆ้ฒ็้พ๏ผๅฏ่ฝๆ ๆณ็ดๆฅไธ่ฝฝ / Some sites have hotlink protection and may block direct downloads
- ๅจๆๅ ่ฝฝ็ๅพ็ๅฏ่ฝ้่ฆไฝฟ็จ
r.jina.ai / Dynamically loaded images may require r.jina.ai