# CLAUDE.md — Project Guide for Claude Code ## What This Project Does This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose. --- ## Key Files | File | Role | |---|---| | `lib/db.rb` | DB client, `ensure_table!`, `upsert` (with write-once semantics for some fields) | | `lib/http.rb` | HTTP client — retries, cookie jar, 403/406 warmup, curl fallback | | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table | | `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write | | `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings | | `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set | | `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) | | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` | | `web/index.php` | Search portal — dynamic UNION across all `da_*` tables | --- ## Running Things Locally ```bash # Full stack docker compose up -d # Run all scrapers once docker compose run --rm scraper /app/run_all.sh # Run a single scraper TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb # Run a subset ONLY=meandervalley,kent docker compose run --rm scraper /app/run_all.sh # Geocode backfill (batch, all tables) docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="..." \ scraper ruby /app/tools/backfill_geocode.rb # Geocode backfill (single table) docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="..." \ -e ONLY_TABLE=da_brighton \ scraper ruby /app/tools/backfill_geocode.rb ``` --- ## Architecture Conventions ### Each scraper follows this pattern: 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included) 3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup) 4. Parse with Nokogiri 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received` 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches ### Write-once fields (in `DB.upsert`): - `date_received` — never overwritten once set - `date_received_raw` — never overwritten once non-blank - `document_url` / `local_document_url` — new value only replaces if existing is NULL ### Table names: - Always derived from the scraper filename: `scrapers/foo.rb` → `da_foo` - `run_all.sh` sets `TABLE_NAME=da_` before invoking each scraper - The `COUNCIL_MAP` in `lib/util.rb` maps internal council keys to table names (used by PlanBuild integration) --- ## Error Handling Conventions After a refactor, the project follows these rules: - **URI building** (`URI.join`, `URI.parse`) → `rescue URI::InvalidURIError` - **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; warn "[scraper] ..."` - **Zlib decompression** → `rescue Zlib::Error` - **Date parsing** (`Date.strptime`, `Date.parse`) → `rescue ArgumentError, Date::Error` - **JSON parsing** → `rescue JSON::ParserError` - **Network/HTTP** → `rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError` - **Enrichment failures** always `warn` to stderr — do not gate them behind `ENRICH_DEBUG` - **No bare `rescue`** — always specify the exception class(es) --- ## Adding or Modifying a Scraper When a council changes its website markup, only that scraper needs updating. The typical failure mode is: - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector - HTTP 403/406 — Council site added WAF; check `Http.get` options or add a warmup step - `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime` To add a new scraper, copy a structurally similar one (e.g. `glamorgan.rb` for table-based sites, `centralhighlands.rb` for link/PDF-based sites) and adapt the parsing logic. The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else. --- ## Database Notes - MariaDB 10.11, `utf8mb4` encoding throughout - Schema is created on-the-fly — `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` - Schema changes go in `lib/migrate.rb` (new migration at end of `MIGRATIONS` array) or `lib/db.rb` (`ensure_table!`) for columns every new table gets - The `geo_cache` table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls - The `UNIQUE KEY uniq_ref_addr (council_reference, address)` constraint drives the upsert behaviour ## Web Portal Notes - `web/index.php` dynamically discovers all `da_*` tables and builds a UNION query - It handles missing columns gracefully (not all tables have every column) - `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible; consider moving it out of the web root or placing it behind authentication --- ## Common Gotchas - **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script. - **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead. - **PlanBuild scrapers**: `planbuild.rb` and `planbuild_fetch.js` handle councils on the state-run PlanBuild portal. They write to per-council tables using `Util.ref_to_table`. These are separate from the council-specific scrapers. - **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR//`. The web portal serves them from `/downloads/` via an Apache alias.