Non-Techy Tutorial · Works great for site audits, rewrites, backups

From Sitemap to Word Docs: The Easiest Way to Save a Whole Site for Editing (Using Zigma.ca as an Example)
Want to copy a website’s content into Word—titles, headings, and paragraphs—without the design clutter? This guide shows you how to fetch every page listed in a sitemap and turn it into clean Word documents you can edit, share, or audit.
Click-worthy Titles You Can Use
- Copy Any Website Into Word—In Minutes! (No Coding Degree Needed)
- From Sitemap to Word Docs: The Easiest Way to Save a Whole Site for Editing
- Turn a Sitemap Into Hundreds of Word Pages—Perfect for Reviews, Audits & Rewrites
- Non-Techy Guide: Export a Website to Word, Headings & All
Table of Contents
- What You’ll Build
- Step 1 — Find the Sitemap
- Step 2 — Install Python
- Step 3 — Create a Project Folder & Virtual Environment
- Step 4 — Install the Tools
- Step 5 — Copy the Script
- Step 6 — Run It (Using Zigma.ca)
- Step 7 — Include/Exclude Sections (Optional)
- Step 8 — Optional Tweaks
- FAQ & Troubleshooting
- Need Help? Get a Pro to Do It
What You’ll Build
By the end, you’ll have a folder full of .docx files—one per page listed in a sitemap. Each Word file starts with:
- URL of the page
- Title tag: the page’s <title>
- Description tag: the meta description (if present)
- Then the main body content with proper H1/H2/H3/H4 headings in Word (so they appear in Word’s Navigation Pane)
Ethics note: only scrape domains you own or have permission to export. Always respect the site’s policies.
Step 1 — Find the Sitemap
Most websites publish a sitemap for search engines. Try:
https://yourdomain.com/sitemap.xmlhttps://yourdomain.com/sitemap(sometimes an HTML sitemap page)
For our example, we’ll use Zigma.ca. If you try /sitemap.xml and don’t find it, try just /sitemap, or search Google for site:zigma.ca sitemap.
Step 2 — Install Python
Install Python 3.10+ from python.org. On Windows, check “Add Python to PATH.” Confirm your version:
python --version
Step 3 — Create a Project Folder & Virtual Environment
mkdir website-to-word
cd website-to-word
python -m venv .venv
.venv\Scripts\Activate
mkdir website-to-word
cd website-to-word
python -m venv .venv
source .venv/bin/activate
Step 4 — Install the Tools
We’ll use friendly libraries to fetch pages, parse HTML, and write Word files:
- requests – download pages
- beautifulsoup4 + lxml – parse HTML and XML
- readability-lxml – fallback cleanup if a page is messy
- python-docx – write .docx files
pip install requests beautifulsoup4 lxml readability-lxml python-docx
Step 5 — Copy the Script
Create a new file named sitemap_to_word.py and paste the script below. It:
- finds URLs in a sitemap (XML, gzipped XML, or an HTML /sitemap page),
- fetches each page and extracts the Title, Description, and main content,
- preserves H1–H4 as real Word headings,
- saves one Word document per page in your output folder.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
sitemap_to_word.py (beginner-friendly)
- Reads a sitemap (XML, gz, or a simple HTML "/sitemap") and collects page URLs
- Fetches each page and extracts:
* URL line
* Title tag and Description tag
* Main body content (prefers the page's 'main' or 'article' region)
- Preserves H1–H4 as Word headings and paragraphs as normal text
- Writes one .docx file per page into your chosen folder
"""
import argparse, os, re, sys, gzip, time
from urllib.parse import urlparse, urljoin
import requests
from bs4 import BeautifulSoup
from lxml import etree, html as lxml_html
from readability import Document
from docx import Document as DocxDocument
from docx.shared import Pt
from docx.oxml.ns import qn
HEADERS = {
"User-Agent": "ZigmaContentFetcher/1.0 (+https://zigma.ca)",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
def safe_request(url, timeout=20):
for attempt in range(3):
try:
return requests.get(url, headers=HEADERS, timeout=timeout, allow_redirects=True)
except Exception:
if attempt == 2:
raise
time.sleep(0.8 * (2 ** attempt))
def normalize_url(u: str) -> str:
p = urlparse(u.strip())
return p._replace(fragment="").geturl()
def extract_links_from_html_sitemap(content: bytes, base_url: str) -> list:
soup = BeautifulSoup(content, "lxml")
base_host = urlparse(base_url).netloc
links = []
for a in soup.select("a[href]"):
href = (a.get("href") or "").strip()
if not href or href.startswith(("mailto:", "tel:", "javascript:", "#")):
continue
abs_url = urljoin(base_url, href)
p = urlparse(abs_url)
if p.scheme in ("http","https") and p.netloc == base_host:
links.append(normalize_url(abs_url))
seen, out = set(), []
for u in links:
if u not in seen:
out.append(u); seen.add(u)
return out
def parse_sitemap_xml(content: bytes, base_url: str) -> tuple[list,list]:
child_maps, urls = [], []
try:
parser = etree.XMLParser(recover=True, huge_tree=True)
root = etree.fromstring(content, parser=parser)
ns = {"sm":"http://www.sitemaps.org/schemas/sitemap/0.9"}
for loc in root.findall(".//sm:sitemap/sm:loc", namespaces=ns):
if loc.text:
child_maps.append(urljoin(base_url, loc.text.strip()))
for loc in root.findall(".//sm:url/sm:loc", namespaces=ns):
if loc.text:
urls.append(urljoin(base_url, loc.text.strip()))
except Exception:
pass
if not child_maps and not urls:
try:
soup = BeautifulSoup(content, "xml")
for smap in soup.find_all("sitemap"):
loc = smap.find("loc")
if loc and loc.text:
child_maps.append(urljoin(base_url, loc.text.strip()))
for u in soup.find_all("url"):
loc = u.find("loc")
if loc and loc.text:
urls.append(urljoin(base_url, loc.text.strip()))
except Exception:
pass
def dedupe(seq):
seen, out = set(), []
for x in seq:
if x not in seen:
out.append(x); seen.add(x)
return out
return dedupe(child_maps), dedupe(urls)
def gather_urls_from_sitemap(sitemap_url: str) -> list:
to_process = [sitemap_url]
seen, found = set(), []
while to_process:
current = to_process.pop()
if current in seen: continue
seen.add(current)
r = safe_request(current, timeout=25)
if not r or r.status_code != 200:
continue
ctype = (r.headers.get("Content-Type") or "").lower()
content = r.content
if current.endswith(".gz") or "gzip" in ctype:
try:
content = gzip.decompress(content)
ctype = "application/xml"
except Exception:
pass
if "text/html" in ctype or current.endswith("/sitemap"):
found.extend(extract_links_from_html_sitemap(content, current))
continue
children, urls = parse_sitemap_xml(content, current)
found.extend(urls)
to_process.extend(children)
# dedupe + keep http(s)
seen, out = set(), []
for u in found:
if u not in seen and urlparse(u).scheme in ("http","https"):
out.append(u); seen.add(u)
return out
def extract_meta(doc):
title, descr, canonical = "", "", None
t = doc.xpath("//title")
if t and t[0].text:
title = t[0].text.strip()
md = doc.xpath('//meta[translate(@name,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="description"]/@content')
if md: descr = md[0].strip()
if not title:
ogt = doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:title"]/@content')
if ogt: title = ogt[0].strip()
if not descr:
ogd = doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:description"]/@content')
if ogd: descr = ogd[0].strip()
can = doc.xpath('//link[translate(@rel,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="canonical"]/@href')
if can: canonical = can[0].strip()
return title, descr, canonical
def clean_html_to_content(html_bytes: bytes, base_url: str) -> str:
# Prefer original main/article/body
try:
soup = BeautifulSoup(html_bytes, "lxml")
root = soup.find("main") or soup.find("article") or soup.body or soup
if root:
for tag in root.find_all(["script","style","noscript","iframe","svg","video","picture","source","form"]):
tag.decompose()
for img in root.find_all("img"):
img.decompose()
for tag in root.find_all(["a","span","div","section","header","footer","em","strong","b","i","u","small","sup","sub"]):
tag.unwrap()
allowed = {"h1","h2","h3","h4","p","ul","ol","li","br"}
for t in list(root.find_all(True)):
if t.name not in allowed and t.name not in ("body","html"):
t.unwrap()
for t in list(root.find_all(True)):
if t.name in {"p","li","h1","h2","h3","h4"} and not t.get_text(strip=True):
t.decompose()
return "
" + "".join(str(c) for c in root.children) + "
" except Exception: pass # Fallback: Readability try: doc = Document(html_bytes) html_clean = doc.summary(html_partial=True) soup = BeautifulSoup(html_clean, "lxml") for tag in soup(["script","style","noscript","iframe","svg","video","picture","source"]): tag.decompose() for img in soup.find_all("img"): img.decompose() allowed = {"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(soup.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() body = soup.body or soup return "
" + "".join(str(c) for c in body.children) + "
" except Exception: pass # Last resort: plain text paragraphs try: soup = BeautifulSoup(html_bytes, "lxml") for tag in soup(["script","style","noscript","iframe","svg"]): tag.decompose() text = soup.get_text(separator="\n") paras = [f"<p>{line.strip()}</p>" for line in text.splitlines() if line.strip()] return "<div>" + "\n".join(paras) + "</div>" except Exception: return "<div></div>" def drop_duplicate_leading_h1(body_html: str, page_title: str) -> str: if not body_html or not page_title: return body_html soup = BeautifulSoup(body_html, "lxml") h1 = soup.find("h1") if h1 and h1.get_text(" ", strip=True).strip() == page_title.strip(): h1.decompose() return str(soup) def add_html_to_docx(doc: DocxDocument, html: str): soup = BeautifulSoup(html, "lxml") for el in soup.find_all(["h1","h2","h3","h4","p","ul","ol"], recursive=True): name = el.name if name in ("h1","h2","h3","h4"): level = {"h1":1,"h2":2,"h3":3,"h4":4}[name] text = el.get_text(" ", strip=True) if text: doc.add_heading(text, level=level) elif name == "p": text = el.get_text(" ", strip=True) if text: doc.add_paragraph(text) elif name in ("ul","ol"): ordered = name == "ol" for li in el.find_all("li", recursive=False): t = li.get_text(" ", strip=True) if not t: continue p = doc.add_paragraph(t) try: p.style = "List Number" if ordered else "List Bullet" except Exception: pass def docx_write_one(url: str, title: str, descr: str, body_html: str, out_path: str): os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True) doc = DocxDocument() style = doc.styles["Normal"] style.font.name = "Calibri" style._element.rPr.rFonts.set(qn('w:eastAsia'), "Calibri") style.font.size = Pt(11) doc.add_heading(title or "(Untitled Page)", level=1) doc.add_paragraph(f"URL: {url}") if title: doc.add_paragraph(f"Title tag: {title}") if descr: doc.add_paragraph(f"Description tag: {descr}") cleaned = drop_duplicate_leading_h1(body_html, title) add_html_to_docx(doc, cleaned) doc.save(out_path) def slugify(value: str) -> str: value = value.strip().lower() value = re.sub(r"[^a-z0-9._-]+", "_", value) return value[:80] or "page" def filename_for(url: str, title: str, idx: int) -> str: if title: name = slugify(title) else: path = urlparse(url).path or "/" name = slugify(path.strip("/").replace("/", "_") or "home") return f"{idx:04d}_{name}.docx" def main(): ap = argparse.ArgumentParser(description="Export a website's pages (from a sitemap) into one Word doc per page.") ap.add_argument("--sitemap", required=True, help="Sitemap URL (XML, gz, or HTML /sitemap).") ap.add_argument("--outdir", required=True, help="Folder to save Word docs.") ap.add_argument("--include-regex", default=None, help="Only include URLs that match this regex.") ap.add_argument("--exclude-regex", default=None, help="Exclude URLs that match this regex.") ap.add_argument("--max-urls", type=int, default=None, help="Limit number of pages (for testing).") args = ap.parse_args() print("[1/3] Reading sitemap…") urls = gather_urls_from_sitemap(args.sitemap) print(f" Found {len(urls)} URLs.") if args.include_regex: inc = re.compile(args.include_regex) urls = [u for u in urls if inc.search(u)] if args.exclude_regex: exc = re.compile(args.exclude_regex) urls = [u for u in urls if not exc.search(u)] if args.max_urls is not None: urls = urls[: args.max_urls] if not urls: print("No URLs to process after filtering.") sys.exit(0) print("[2/3] Fetching pages & writing Word docs…") os.makedirs(args.outdir, exist_ok=True) for i, url in enumerate(urls, start=1): try: r = safe_request(url, timeout=25) if not r: print(f" SKIP {url} (no response)") continue if r.status_code >= 400: print(f" SKIP {url} (HTTP {r.status_code})") continue ctype = (r.headers.get("Content-Type") or "").lower() if "text/html" not in ctype and "application/xhtml+xml" not in ctype: print(f" SKIP {url} (non-HTML: {ctype})") continue doc = lxml_html.fromstring(r.content) doc.make_links_absolute(url) title, descr, _canon = extract_meta(doc) body_html = clean_html_to_content(r.content, url) out_name = filename_for(url, title, i) out_path = os.path.join(args.outdir, out_name) docx_write_one(url, title, descr, body_html, out_path) print(f" OK {url} -> {out_name}") except Exception as e: print(f" ERR {url} ({e})") print("[3/3] Done.") print(f"Saved Word docs in: {os.path.abspath(args.outdir)}") if __name__ == "__main__": main()
Step 6 — Run It (Using Zigma.ca)
Assuming the sitemap is at /sitemap.xml (switch to /sitemap if needed):
python .\sitemap_to_word.py `
--sitemap https://zigma.ca/sitemap.xml `
--outdir .\exports `
--max-urls 50
python ./sitemap_to_word.py \
--sitemap https://zigma.ca/sitemap.xml \
--outdir ./exports \
--max-urls 50
Open the exports folder and you’ll see files like:
0001_home.docx
0002_about.docx
0003_services_digital_marketing.docx
...
Step 7 — Include/Exclude Sections (Optional)
Want only certain sections? Use --include-regex or --exclude-regex (you can combine them):
--include-regex '^https://(www\.)?zigma\.ca/(blog|services)/'
--exclude-regex '^https://(www\.)?zigma\.ca/blog/'
Tip for PowerShell: wrap regex in single quotes so backslashes aren’t “eaten” by the shell.
Step 8 — Optional Tweaks
- Headings: The script maps H1–H4 to Word Heading 1–4. You can add H5/H6 mapping in
add_html_to_docxif your site uses them heavily. - Inline bold/italic: To keep things simple, inline formatting is not preserved. If you want it, you can extend the function to parse inline tags and add Word “runs.”
- One giant doc: Prefer a single Word file? Instead of writing per page, create one
DocxDocument()at the start and append each page with adoc.add_page_break().
FAQ & Troubleshooting
Will this download images? No. It intentionally strips images, scripts, iframes, and other non-text elements.
The output looks short / hero text missing. The script prefers the actual main/article region of the HTML. Most sites work great. If a page injects text via JavaScript, you may need a browser-based scraper (e.g., Playwright), which is more advanced.
“No URLs to process after filtering.” Your include/exclude regex might be too strict. Try without filters first.
Where did my files save? In the folder you passed to --outdir. Use absolute paths if unsure.
Is this legal? Only export content you own or have permission to copy. Follow robots and T&Cs.
Need Help? We’ll Do It For You.
Whether you’re planning a full site rewrite, an SEO audit, or a content migration, we can automate exports, preserve advanced formatting, generate CSV reports, and even handle JavaScript-heavy pages.
Talk to Zigma — Digital Marketing
If you have any projects coming up, contact us. We’re happy to help.