# CLAUDE.md — Project Guide for Claude Code ## What This Project Does This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose. --- ## Key Files | File | Role | |---|---| | `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) | | `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback | | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table | | `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write | | `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings | | `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` | | `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set | | `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) | | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` | | `web/index.php` | Search portal — dynamic UNION across all `da_*` tables | --- ## Running Things Locally ```bash # Full stack docker compose up -d # Run all scrapers once docker compose run --rm scraper /app/run_all.sh # Run a single scraper TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb # Run a subset ONLY=meandervalley,kent docker compose run --rm scraper /app/run_all.sh # Geocode backfill (batch, all tables) docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="..." \ scraper ruby /app/tools/backfill_geocode.rb # Geocode backfill (single table) docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="..." \ -e ONLY_TABLE=da_brighton \ scraper ruby /app/tools/backfill_geocode.rb ``` --- ## Architecture Conventions ### Each scraper follows this pattern: 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included) 3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers) 4. Parse with Nokogiri 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received` 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches ### WAF / Cloudflare handling: - `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically. - For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation. - Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`. - The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`. ### Write-once fields (in `DB.upsert`): - `date_received` — never overwritten once set - `date_received_raw` — never overwritten once non-blank - `document_url` / `local_document_url` — new value only replaces if existing is NULL ### Table names: - Always derived from the scraper filename: `scrapers/foo.rb` → `da_foo` - `run_all.sh` sets `TABLE_NAME=da_` before invoking each scraper - The `COUNCIL_MAP` in `lib/util.rb` maps internal council keys to table names (used by PlanBuild integration) --- ## Error Handling Conventions After a refactor, the project follows these rules: - **URI building** (`URI.join`, `URI.parse`) → `rescue URI::InvalidURIError` - **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; warn "[scraper] ..."` - **Zlib decompression** → `rescue Zlib::Error` - **Date parsing** (`Date.strptime`, `Date.parse`) → `rescue ArgumentError, Date::Error` - **JSON parsing** → `rescue JSON::ParserError` - **Network/HTTP** → `rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError` - **Enrichment failures** always `warn` to stderr — do not gate them behind `ENRICH_DEBUG` - **No bare `rescue`** — always specify the exception class(es) --- ## Adding or Modifying a Scraper When a council changes its website markup, only that scraper needs updating. The typical failure mode is: - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector - HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`) - Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning - `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime` **Template choice:** - Simple HTML list/table → copy `glamorgan.rb` - Link/PDF listing → copy `centralhighlands.rb` - WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download) - Multi-hop redirect to detail pages → copy `derwentvalley.rb` The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else. --- ## Database Notes - MariaDB 10.11, `utf8mb4` encoding throughout - Schema is created on-the-fly — `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` - Schema changes go in `lib/migrate.rb` (new migration at end of `MIGRATIONS` array) or `lib/db.rb` (`ensure_table!`) for columns every new table gets - The `geo_cache` table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls - The `UNIQUE KEY uniq_ref_addr (council_reference, address)` constraint drives the upsert behaviour ## Web Portal Notes - `web/index.php` dynamically discovers all `da_*` tables and builds a UNION query - It handles missing columns gracefully (not all tables have every column) - `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible; consider moving it out of the web root or placing it behind authentication --- ## Common Gotchas - **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script. - **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead. - **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers. - **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR//`. The web portal serves them from `/downloads/` via an Apache alias. - **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern. - **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`. - **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.