Replacing a €6k/yr SaaS with an open-source crawl auditor.
Most SEO teams either pay between €300 and €1,000 a month for a tool that ingests their raw access logs to a vendor cloud, or they fly blind on crawl budget. Both are wrong. So I built seo-log-auditor: a free, open-source, local-first Streamlit dashboard that turns a 30-day nginx export into seven concrete crawl-audit views in 30 seconds, without a single byte leaving your laptop.
Log analysis was a captive market.
Every serious technical SEO audit needs the same input: real server access logs. The questions you can answer from them are the highest-leverage ones in the field. Where is Googlebot actually spending its crawl budget? Which orphan URLs is it hitting? How much crawl is being burned on redirects and 4xx? Which sitemap pages has Google not visited in 30 days? Which paths are exploding into hundreds of parameterised variants?
And yet the entire category is gated behind €300–€1,000 a month SaaS tools that require you to upload your raw production access logs to a vendor cloud. That is a privacy and compliance question every SRE I have ever worked with pushes back on, and rightly so. The result: most teams quietly skip log analysis altogether, which is the same as deciding crawl budget waste does not exist.
Seven questions, one local screen.
I sat down with a year of audit notes and listed every question I had ever asked of an access log file. The list collapsed cleanly into seven recurring views. Each one answers a single question that maps to a single recommendation. Each one has a chart, a table, and a download.
- Crawl Budget. Where is Googlebot spending hits, vs. where do your URLs actually live? Large positive delta = over-crawled page type. Large negative = neglected.
- Orphan Pages. URLs Googlebot is hitting that aren’t in your sitemap. Often old marketing landers or deleted sections still earning backlinks.
- Status Waste. Share of crawl traffic burned on 3xx, 4xx, 5xx, broken down by page type.
- Stale Pages. Sitemap URLs Google has not visited in N days. Prime candidates for the “deep crawl leakage” problem.
- Performance. How page size and latency correlate with crawl frequency. Find the size-based inflection point where Google starts skipping.
- Bot Verification. Real Googlebot (verified against Google’s published IP ranges or rDNS) vs. spoofed user-agents.
- Parameter Traps. Paths whose query strings explode into hundreds of unique variants: faceted nav, session IDs, tracking params, sort orders.
A pipeline with no vendor.
Loki / Grafana export
JSON, JSONL, NDJSON, CSV or raw nginx. Auto-detected.
Python parser
Vectorised pandas. 430k rows in under 30 seconds on a laptop.
Sitemap + Googlebot IPs
Two outbound calls, both to public endpoints. That is it.
Streamlit dashboard
Seven multipage views, custom Plotly theme, served on localhost.
The whole tool is around 1,500 lines of Python. The seven analysis modules are pure functions: they take a parsed DataFrame and return a DataFrame. No Streamlit dependency, so they import cleanly into a notebook or a CI script. The Streamlit layer is intentionally thin.
# the entire onboarding
$ uvx seo-log-auditor
# dashboard opens at localhost:8501
# drop in your log file, paste your sitemap, click Load
v0.1, in public.
- PyPI package with console-script entry point. Works with
uvx,pipx, or plainpip. - Double-click launchers for macOS and Windows so non-CLI users can run it without ever opening a terminal.
- Auto-detecting parser for Loki / Grafana JSON, JSONL, NDJSON, CSV, and raw nginx access-log lines.
- Editorial Tech theme matching this site, with a bundled Plotly template for consistent dark-mode charts.
- Verified Googlebot detection via Google’s published IP-range JSON and optional reverse-DNS lookups.
- YAML-driven URL classifier with a starter config bundled, so users can tag pagination, faceted filters, and tracking beacons as their own page types.
- 31 unit tests, GitHub Actions CI on Python 3.10 / 3.11 / 3.12, MIT license, contribution and security policies.
What this proves.
- Local-first is a feature, not a constraint. Removing the “upload your logs” step removed the single biggest objection from every SRE / compliance review I have ever seen on log tooling.
- Seven views beats seventy. Every analysis module maps 1:1 to an audit recommendation. Tools that surface 50 dashboards and let you figure it out are doing the wrong job.
- Distribution beats marketing. One
uvxcommand beats a landing page with a free-trial form. Every paywall added between user and tool kills adoption. - Open source is a portfolio asset. A public repo with tests, CI, a license, and a README is more legible to engineers than a case-study deck. This page is mostly a pointer to the repo.
Roadmap: direct Loki API streaming, SQLite cache for week-over-week comparison, Cloudflare and AWS log parsers, a --demo flag with bundled synthetic data, internal-link-depth correlation via Screaming Frog and Sitebulb importers. PRs are open and welcome at CONTRIBUTING.md.