CLAUDE.md — Project Guide for Claude Code

What This Project Does

This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose.

Key Files

File	Role
`lib/db.rb`	DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics)
`lib/http.rb`	HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback
`lib/geocode.rb`	Google Maps geocoding with SHA1 cache in `geo_cache` table
`lib/enrich.rb`	`enrich_after_upsert!` — geocoding + property lookup after each DB write
`lib/util.rb`	`parse_aus_date`, council-name/table-name mappings
`lib/scraper_helpers.rb`	Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!`
`run_all.sh`	Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set
`entrypoint.sh`	Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set)
`scrapers/*.rb`	One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!`
`web/index.php`	Search portal — dynamic UNION across all `da_*` tables

Running Things Locally

# Full stack
docker compose up -d

# Run all scrapers once
docker compose run --rm scraper /app/run_all.sh

# Run a single scraper
TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb

# Run a subset
ONLY=meandervalley,kent docker compose run --rm scraper /app/run_all.sh

# Geocode backfill (batch, all tables)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  scraper ruby /app/tools/backfill_geocode.rb

# Geocode backfill (single table)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  -e ONLY_TABLE=da_brighton \
  scraper ruby /app/tools/backfill_geocode.rb

Architecture Conventions

Each scraper follows this pattern:

TABLE = ENV.fetch("TABLE_NAME") — set by run_all.sh from the filename
DB.ensure_table!(TABLE) — idempotent schema setup (all columns already included)
Fetch HTML via Http.get(url) (handles retries, cookies, WAF warmup, browser-fingerprint headers)
Parse with Nokogiri
DB.upsert(TABLE, row) — upserts on (council_reference, address), write-once for date_received
enrich_after_upsert!(table:, council_reference:, address:) — geocodes and enriches

WAF / Cloudflare handling:

lib/http.rb sends a full browser fingerprint on every request: User-Agent, sec-ch-ua*, Sec-Fetch-*, Upgrade-Insecure-Requests. This satisfies most WAF header checks automatically.
For sites that also need a warm cookie state (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see burnie.rb as the reference implementation.
Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in COUNCIL_MAP), data is still collected via planbuild.rb.
The warmup pattern (custom CookieJar + http_get with redirect handling) is self-contained in scrapers that need it and does not depend on lib/http.rb.

Write-once fields (in `DB.upsert`):

date_received — never overwritten once set
date_received_raw — never overwritten once non-blank
document_url / local_document_url — new value only replaces if existing is NULL

Table names:

Always derived from the scraper filename: scrapers/foo.rb → da_foo
run_all.sh sets TABLE_NAME=da_<basename> before invoking each scraper
The COUNCIL_MAP in lib/util.rb maps internal council keys to table names (used by PlanBuild integration)

Error Handling Conventions

After a refactor, the project follows these rules:

URI building (URI.join, URI.parse) → rescue URI::InvalidURIError
DB operations (prepare/execute) → rescue Mysql2::Error => e; warn "[scraper] ..."
Zlib decompression → rescue Zlib::Error
Date parsing (Date.strptime, Date.parse) → rescue ArgumentError, Date::Error
JSON parsing → rescue JSON::ParserError
Network/HTTP → rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError
Enrichment failures always warn to stderr — do not gate them behind ENRICH_DEBUG
No bare rescue — always specify the exception class(es)

Adding or Modifying a Scraper

When a council changes its website markup, only that scraper needs updating. The typical failure mode is:

Found 0 rows — CSS selector no longer matches; inspect the live page and update the selector
HTTP 403/406 — Council site added WAF; check Http.get options or add a proactive warmup step (see burnie.rb)
Cloudflare JS challenge ("Just a moment" in body) — cannot be solved in Ruby; exit cleanly with a warning
date_received all nil — Date format changed; update the format string passed to Util.parse_aus_date or Date.strptime

Template choice:

Simple HTML list/table → copy glamorgan.rb
Link/PDF listing → copy centralhighlands.rb
WAF-protected site needing homepage warmup → copy kingisland.rb (minimal) or burnie.rb (full-featured with PDF download)
Multi-hop redirect to detail pages → copy derwentvalley.rb

The shared infrastructure (Http, DB, enrich_after_upsert!) handles everything else.

Database Notes

MariaDB 10.11, utf8mb4 encoding throughout
Schema is created on-the-fly — CREATE TABLE IF NOT EXISTS + ALTER TABLE ... ADD COLUMN IF NOT EXISTS
Schema changes go in lib/migrate.rb (new migration at end of MIGRATIONS array) or lib/db.rb (ensure_table!) for columns every new table gets
The geo_cache table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls
The UNIQUE KEY uniq_ref_addr (council_reference, address) constraint drives the upsert behaviour

Web Portal Notes

web/index.php dynamically discovers all da_* tables and builds a UNION query
It handles missing columns gracefully (not all tables have every column)
web/backfill_pid_title.php is a legacy admin tool — it should not be publicly accessible; consider moving it out of the web root or placing it behind authentication

Common Gotchas

TABLE constant conflicts: Each scraper defines TABLE = ENV.fetch("TABLE_NAME") at the top level. If you require two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
COUNCIL_FILTER / COUNCIL_WHITELIST: The docker-compose.yml has a COUNCIL_WHITELIST env var that is passed to the scraper container but is not wired into run_all.sh. Use ONLY / SKIP in run_all.sh instead.
PlanBuild scrapers: planbuild.rb handles councils on the state-run PlanBuild portal. It writes to per-council tables using Util.ref_to_table. These run alongside the council-specific scrapers.
PDF downloads: Only happen when DOWNLOAD_ATTACHMENTS=1. Files land in DOWNLOAD_DIR/<councilname>/. The web portal serves them from /downloads/ via an Apache alias.
Non-ASCII in PDF URLs: Some council sites embed Unicode characters (e.g. en-dash –) directly in PDF filenames. Always percent-encode hrefs before passing to URI.join — see burnie.rb first_pdf_on_detail for the pattern.
Redirect loops in Net::HTTP.start blocks: next inside a Net::HTTP.start block exits the block, not the enclosing while loop. Use a redirect_to variable set inside the block and call next on the while loop after the block returns — see burnie.rb http_get_with_cookies.
Cloudflare JS challenge vs IP block: A JS challenge ("Just a moment") may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: derwentvalley.tas.gov.au, latrobe.tas.gov.au.

CLAUDE.md 8.2 KB Historia Raaka