◇ open source · 2024 · python

Seven SEO reports from a log file.

Most SEO teams rely on crawlers and GSC to understand what search bots actually do on a site. Both have gaps. Server access logs don’t. I built seo-log-auditor because I kept doing the same log analysis manually on client projects — so I packaged it into a Streamlit app anyone can run with one command.

view on GitHub talk to Hiten

audit reports

command to run

0 €

tooling cost

MIT

open source license

01 · problem

Crawlers don’t tell the full story.

Screaming Frog and Botify are excellent for understanding what a site looks like from a crawler’s perspective. But they can only see what they’re allowed to see. Server access logs are the ground truth: every single HTTP request that hit the server, from every bot, regardless of robots.txt.

On every technical SEO audit I’ve run, the biggest leaks were invisible to crawlers. Budget wasted on redirect chains that resolved outside the crawl window. Bots hammering URLs with session parameters that no crawler would follow. Pages marked indexable but never touched by Googlebot in months.

The fix was always the same: open the logs, write some pandas, repeat. So I built the tool.

02 · architecture

Drop a log, get a dashboard.

The app accepts a raw log file via the Streamlit file uploader, parses it with regex, and pipes the resulting dataframe into seven tabbed analysis views. No database, no cloud dependency. Runs on a laptop.

# run locally in under 30 seconds
git clone https://github.com/hitensangani/seo-log-auditor
cd seo-log-auditor
pip install -r requirements.txt
streamlit run app.py

03 · reports

Seven views, one log file.

01 · crawl budget

Bot vs human traffic

Splits requests by user-agent, breaks down which bots are consuming budget, and highlights bot-to-human ratios that signal crawl inefficiency.

02 · orphan urls

Pages bots find, humans don’t

Cross-references bot-accessed URLs against known sitemap and internal link signals to surface URLs that receive crawl attention but no organic equity.

03 · status waste

4xx and 5xx crawl cost

Ranks the most-crawled non-200 URLs by bot hits, so you can see exactly how much budget is being spent on broken or server-error responses.

04 · stale pages

Crawled but forgotten

Flags URLs that haven’t been visited by any major crawler in 30+ days. A reliable signal for pages that are indexed but outside Googlebot’s active crawl cycle.

05 · performance

Slow URL identification

Aggregates server response times per URL and surfaces the slowest pages — a proxy for CWV issues before pulling up Lighthouse or CrUX data.

06 · bot verification

Fake vs legitimate bots

Compares declared user-agents against known reverse-DNS signatures to identify scrapers and shadow crawlers masquerading as Googlebot or Bingbot.

07 · parameter traps

Infinite URL space detection

Surfaces URL patterns with high parameter variance — session IDs, filters, sort orders — that fragment crawl budget across hundreds of near-duplicate URLs.

04 · context

Why I built it in public.

Log analysis is one of the highest-leverage SEO activities and one of the least democratised. Enterprise teams use Botify or Oncrawl. Everyone else exports a CSV from their hosting panel and opens Excel. That gap felt fixable.

Building in public also forced me to write cleaner code — the kind that runs on someone else’s machine with a pip install and no tribal knowledge. That discipline translates directly to the production PRs I write on engineering teams.

The project is MIT-licensed. Contributions welcome.

back to

All shipped projects.

← see all next project

GuitarTuna YouTube Practice Toolkit.

read the project →