Contact
us
skip to content

Non-Techy Tutorial · Works great for site audits, rewrites, backups

From Sitemap to Word Docs: The Easiest Way to Save a Whole Site for Editing (Using Zigma.ca as an Example)

Want to copy a website’s content into Word—titles, headings, and paragraphs—without the design clutter? This guide shows you how to fetch every page listed in a sitemap and turn it into clean Word documents you can edit, share, or audit.

Click-worthy Titles You Can Use

  • Copy Any Website Into Word—In Minutes! (No Coding Degree Needed)
  • From Sitemap to Word Docs: The Easiest Way to Save a Whole Site for Editing
  • Turn a Sitemap Into Hundreds of Word Pages—Perfect for Reviews, Audits & Rewrites
  • Non-Techy Guide: Export a Website to Word, Headings & All

Table of Contents

  1. What You’ll Build
  2. Step 1 — Find the Sitemap
  3. Step 2 — Install Python
  4. Step 3 — Create a Project Folder & Virtual Environment
  5. Step 4 — Install the Tools
  6. Step 5 — Copy the Script
  7. Step 6 — Run It (Using Zigma.ca)
  8. Step 7 — Include/Exclude Sections (Optional)
  9. Step 8 — Optional Tweaks
  10. FAQ & Troubleshooting
  11. Need Help? Get a Pro to Do It

What You’ll Build

By the end, you’ll have a folder full of .docx files—one per page listed in a sitemap. Each Word file starts with:

  • URL of the page
  • Title tag: the page’s <title>
  • Description tag: the meta description (if present)
  • Then the main body content with proper H1/H2/H3/H4 headings in Word (so they appear in Word’s Navigation Pane)

Ethics note: only scrape domains you own or have permission to export. Always respect the site’s policies.

Step 1 — Find the Sitemap

Most websites publish a sitemap for search engines. Try:

  • https://yourdomain.com/sitemap.xml
  • https://yourdomain.com/sitemap (sometimes an HTML sitemap page)

For our example, we’ll use Zigma.ca. If you try /sitemap.xml and don’t find it, try just /sitemap, or search Google for site:zigma.ca sitemap.

Step 2 — Install Python

Install Python 3.10+ from python.org. On Windows, check “Add Python to PATH.” Confirm your version:

Shell
python --version

Step 3 — Create a Project Folder & Virtual Environment

Windows PowerShell
mkdir website-to-word
cd website-to-word
python -m venv .venv
.venv\Scripts\Activate
macOS / Linux
mkdir website-to-word
cd website-to-word
python -m venv .venv
source .venv/bin/activate

Step 4 — Install the Tools

We’ll use friendly libraries to fetch pages, parse HTML, and write Word files:

  • requests – download pages
  • beautifulsoup4 + lxml – parse HTML and XML
  • readability-lxml – fallback cleanup if a page is messy
  • python-docx – write .docx files
Shell
pip install requests beautifulsoup4 lxml readability-lxml python-docx

Step 5 — Copy the Script

Create a new file named sitemap_to_word.py and paste the script below. It:

  • finds URLs in a sitemap (XML, gzipped XML, or an HTML /sitemap page),
  • fetches each page and extracts the Title, Description, and main content,
  • preserves H1–H4 as real Word headings,
  • saves one Word document per page in your output folder.
Python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
sitemap_to_word.py (beginner-friendly)
- Reads a sitemap (XML, gz, or a simple HTML "/sitemap") and collects page URLs
- Fetches each page and extracts:
  * URL line
  * Title tag and Description tag
  * Main body content (prefers the page's 'main' or 'article' region)
- Preserves H1–H4 as Word headings and paragraphs as normal text
- Writes one .docx file per page into your chosen folder
"""

import argparse, os, re, sys, gzip, time
from urllib.parse import urlparse, urljoin

import requests
from bs4 import BeautifulSoup
from lxml import etree, html as lxml_html
from readability import Document
from docx import Document as DocxDocument
from docx.shared import Pt
from docx.oxml.ns import qn

HEADERS = {
    "User-Agent": "ZigmaContentFetcher/1.0 (+https://zigma.ca)",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
}

def safe_request(url, timeout=20):
    for attempt in range(3):
        try:
            return requests.get(url, headers=HEADERS, timeout=timeout, allow_redirects=True)
        except Exception:
            if attempt == 2:
                raise
            time.sleep(0.8 * (2 ** attempt))

def normalize_url(u: str) -> str:
    p = urlparse(u.strip())
    return p._replace(fragment="").geturl()

def extract_links_from_html_sitemap(content: bytes, base_url: str) -> list:
    soup = BeautifulSoup(content, "lxml")
    base_host = urlparse(base_url).netloc
    links = []
    for a in soup.select("a[href]"):
        href = (a.get("href") or "").strip()
        if not href or href.startswith(("mailto:", "tel:", "javascript:", "#")):
            continue
        abs_url = urljoin(base_url, href)
        p = urlparse(abs_url)
        if p.scheme in ("http","https") and p.netloc == base_host:
            links.append(normalize_url(abs_url))
    seen, out = set(), []
    for u in links:
        if u not in seen:
            out.append(u); seen.add(u)
    return out

def parse_sitemap_xml(content: bytes, base_url: str) -> tuple[list,list]:
    child_maps, urls = [], []
    try:
        parser = etree.XMLParser(recover=True, huge_tree=True)
        root = etree.fromstring(content, parser=parser)
        ns = {"sm":"http://www.sitemaps.org/schemas/sitemap/0.9"}
        for loc in root.findall(".//sm:sitemap/sm:loc", namespaces=ns):
            if loc.text:
                child_maps.append(urljoin(base_url, loc.text.strip()))
        for loc in root.findall(".//sm:url/sm:loc", namespaces=ns):
            if loc.text:
                urls.append(urljoin(base_url, loc.text.strip()))
    except Exception:
        pass
    if not child_maps and not urls:
        try:
            soup = BeautifulSoup(content, "xml")
            for smap in soup.find_all("sitemap"):
                loc = smap.find("loc")
                if loc and loc.text:
                    child_maps.append(urljoin(base_url, loc.text.strip()))
            for u in soup.find_all("url"):
                loc = u.find("loc")
                if loc and loc.text:
                    urls.append(urljoin(base_url, loc.text.strip()))
        except Exception:
            pass
    def dedupe(seq):
        seen, out = set(), []
        for x in seq:
            if x not in seen:
                out.append(x); seen.add(x)
        return out
    return dedupe(child_maps), dedupe(urls)

def gather_urls_from_sitemap(sitemap_url: str) -> list:
    to_process = [sitemap_url]
    seen, found = set(), []
    while to_process:
        current = to_process.pop()
        if current in seen: continue
        seen.add(current)
        r = safe_request(current, timeout=25)
        if not r or r.status_code != 200:
            continue
        ctype = (r.headers.get("Content-Type") or "").lower()
        content = r.content
        if current.endswith(".gz") or "gzip" in ctype:
            try:
                content = gzip.decompress(content)
                ctype = "application/xml"
            except Exception:
                pass
        if "text/html" in ctype or current.endswith("/sitemap"):
            found.extend(extract_links_from_html_sitemap(content, current))
            continue
        children, urls = parse_sitemap_xml(content, current)
        found.extend(urls)
        to_process.extend(children)
    # dedupe + keep http(s)
    seen, out = set(), []
    for u in found:
        if u not in seen and urlparse(u).scheme in ("http","https"):
            out.append(u); seen.add(u)
    return out

def extract_meta(doc):
    title, descr, canonical = "", "", None
    t = doc.xpath("//title")
    if t and t[0].text:
        title = t[0].text.strip()
    md = doc.xpath('//meta[translate(@name,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="description"]/@content')
    if md: descr = md[0].strip()
    if not title:
        ogt = doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:title"]/@content')
        if ogt: title = ogt[0].strip()
    if not descr:
        ogd = doc.xpath('//meta[translate(@property,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="og:description"]/@content')
        if ogd: descr = ogd[0].strip()
    can = doc.xpath('//link[translate(@rel,"ABCDEFGHIJKLMNOPQRSTUVWXYZ","abcdefghijklmnopqrstuvwxyz")="canonical"]/@href')
    if can: canonical = can[0].strip()
    return title, descr, canonical

def clean_html_to_content(html_bytes: bytes, base_url: str) -> str:
    # Prefer original main/article/body
    try:
        soup = BeautifulSoup(html_bytes, "lxml")
        root = soup.find("main") or soup.find("article") or soup.body or soup
        if root:
            for tag in root.find_all(["script","style","noscript","iframe","svg","video","picture","source","form"]):
                tag.decompose()
            for img in root.find_all("img"):
                img.decompose()
            for tag in root.find_all(["a","span","div","section","header","footer","em","strong","b","i","u","small","sup","sub"]):
                tag.unwrap()
            allowed = {"h1","h2","h3","h4","p","ul","ol","li","br"}
            for t in list(root.find_all(True)):
                if t.name not in allowed and t.name not in ("body","html"):
                    t.unwrap()
            for t in list(root.find_all(True)):
                if t.name in {"p","li","h1","h2","h3","h4"} and not t.get_text(strip=True):
                    t.decompose()
            return " 
" + "".join(str(c) for c in root.children) + "
" except Exception: pass # Fallback: Readability try: doc = Document(html_bytes) html_clean = doc.summary(html_partial=True) soup = BeautifulSoup(html_clean, "lxml") for tag in soup(["script","style","noscript","iframe","svg","video","picture","source"]): tag.decompose() for img in soup.find_all("img"): img.decompose() allowed = {"h1","h2","h3","h4","p","ul","ol","li","br"} for t in list(soup.find_all(True)): if t.name not in allowed and t.name not in ("body","html"): t.unwrap() body = soup.body or soup return "
" + "".join(str(c) for c in body.children) + "

" except Exception: pass # Last resort: plain text paragraphs try: soup = BeautifulSoup(html_bytes, "lxml") for tag in soup(["script","style","noscript","iframe","svg"]): tag.decompose() text = soup.get_text(separator="\n") paras = [f"<p>{line.strip()}</p>" for line in text.splitlines() if line.strip()] return "<div>" + "\n".join(paras) + "</div>" except Exception: return "<div></div>" def drop_duplicate_leading_h1(body_html: str, page_title: str) -> str: if not body_html or not page_title: return body_html soup = BeautifulSoup(body_html, "lxml") h1 = soup.find("h1") if h1 and h1.get_text(" ", strip=True).strip() == page_title.strip(): h1.decompose() return str(soup) def add_html_to_docx(doc: DocxDocument, html: str): soup = BeautifulSoup(html, "lxml") for el in soup.find_all(["h1","h2","h3","h4","p","ul","ol"], recursive=True): name = el.name if name in ("h1","h2","h3","h4"): level = {"h1":1,"h2":2,"h3":3,"h4":4}[name] text = el.get_text(" ", strip=True) if text: doc.add_heading(text, level=level) elif name == "p": text = el.get_text(" ", strip=True) if text: doc.add_paragraph(text) elif name in ("ul","ol"): ordered = name == "ol" for li in el.find_all("li", recursive=False): t = li.get_text(" ", strip=True) if not t: continue p = doc.add_paragraph(t) try: p.style = "List Number" if ordered else "List Bullet" except Exception: pass def docx_write_one(url: str, title: str, descr: str, body_html: str, out_path: str): os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True) doc = DocxDocument() style = doc.styles["Normal"] style.font.name = "Calibri" style._element.rPr.rFonts.set(qn('w:eastAsia'), "Calibri") style.font.size = Pt(11) doc.add_heading(title or "(Untitled Page)", level=1) doc.add_paragraph(f"URL: {url}") if title: doc.add_paragraph(f"Title tag: {title}") if descr: doc.add_paragraph(f"Description tag: {descr}") cleaned = drop_duplicate_leading_h1(body_html, title) add_html_to_docx(doc, cleaned) doc.save(out_path) def slugify(value: str) -> str: value = value.strip().lower() value = re.sub(r"[^a-z0-9._-]+", "_", value) return value[:80] or "page" def filename_for(url: str, title: str, idx: int) -> str: if title: name = slugify(title) else: path = urlparse(url).path or "/" name = slugify(path.strip("/").replace("/", "_") or "home") return f"{idx:04d}_{name}.docx" def main(): ap = argparse.ArgumentParser(description="Export a website's pages (from a sitemap) into one Word doc per page.") ap.add_argument("--sitemap", required=True, help="Sitemap URL (XML, gz, or HTML /sitemap).") ap.add_argument("--outdir", required=True, help="Folder to save Word docs.") ap.add_argument("--include-regex", default=None, help="Only include URLs that match this regex.") ap.add_argument("--exclude-regex", default=None, help="Exclude URLs that match this regex.") ap.add_argument("--max-urls", type=int, default=None, help="Limit number of pages (for testing).") args = ap.parse_args() print("[1/3] Reading sitemap…") urls = gather_urls_from_sitemap(args.sitemap) print(f" Found {len(urls)} URLs.") if args.include_regex: inc = re.compile(args.include_regex) urls = [u for u in urls if inc.search(u)] if args.exclude_regex: exc = re.compile(args.exclude_regex) urls = [u for u in urls if not exc.search(u)] if args.max_urls is not None: urls = urls[: args.max_urls] if not urls: print("No URLs to process after filtering.") sys.exit(0) print("[2/3] Fetching pages & writing Word docs…") os.makedirs(args.outdir, exist_ok=True) for i, url in enumerate(urls, start=1): try: r = safe_request(url, timeout=25) if not r: print(f" SKIP {url} (no response)") continue if r.status_code >= 400: print(f" SKIP {url} (HTTP {r.status_code})") continue ctype = (r.headers.get("Content-Type") or "").lower() if "text/html" not in ctype and "application/xhtml+xml" not in ctype: print(f" SKIP {url} (non-HTML: {ctype})") continue doc = lxml_html.fromstring(r.content) doc.make_links_absolute(url) title, descr, _canon = extract_meta(doc) body_html = clean_html_to_content(r.content, url) out_name = filename_for(url, title, i) out_path = os.path.join(args.outdir, out_name) docx_write_one(url, title, descr, body_html, out_path) print(f" OK {url} -> {out_name}") except Exception as e: print(f" ERR {url} ({e})") print("[3/3] Done.") print(f"Saved Word docs in: {os.path.abspath(args.outdir)}") if __name__ == "__main__": main()

Step 6 — Run It (Using Zigma.ca)

Assuming the sitemap is at /sitemap.xml (switch to /sitemap if needed):

Windows PowerShell
python .\sitemap_to_word.py `
  --sitemap https://zigma.ca/sitemap.xml `
  --outdir .\exports `
  --max-urls 50
macOS / Linux
python ./sitemap_to_word.py \
  --sitemap https://zigma.ca/sitemap.xml \
  --outdir ./exports \
  --max-urls 50

Open the exports folder and you’ll see files like:

Example output

0001_home.docx
0002_about.docx
0003_services_digital_marketing.docx
...

Step 7 — Include/Exclude Sections (Optional)

Want only certain sections? Use --include-regex or --exclude-regex (you can combine them):

Only include blog & services
--include-regex '^https://(www\.)?zigma\.ca/(blog|services)/'
Exclude the blog
--exclude-regex '^https://(www\.)?zigma\.ca/blog/'

Tip for PowerShell: wrap regex in single quotes so backslashes aren’t “eaten” by the shell.

Step 8 — Optional Tweaks

  • Headings: The script maps H1–H4 to Word Heading 1–4. You can add H5/H6 mapping in add_html_to_docx if your site uses them heavily.
  • Inline bold/italic: To keep things simple, inline formatting is not preserved. If you want it, you can extend the function to parse inline tags and add Word “runs.”
  • One giant doc: Prefer a single Word file? Instead of writing per page, create one DocxDocument() at the start and append each page with a doc.add_page_break().

FAQ & Troubleshooting

Will this download images? No. It intentionally strips images, scripts, iframes, and other non-text elements.

The output looks short / hero text missing. The script prefers the actual main/article region of the HTML. Most sites work great. If a page injects text via JavaScript, you may need a browser-based scraper (e.g., Playwright), which is more advanced.

“No URLs to process after filtering.” Your include/exclude regex might be too strict. Try without filters first.

Where did my files save? In the folder you passed to --outdir. Use absolute paths if unsure.

Is this legal? Only export content you own or have permission to copy. Follow robots and T&Cs.

Need Help? We’ll Do It For You.

Whether you’re planning a full site rewrite, an SEO audit, or a content migration, we can automate exports, preserve advanced formatting, generate CSV reports, and even handle JavaScript-heavy pages.

Talk to Zigma — Digital Marketing

If you have any projects coming up, contact us. We’re happy to help.