# CLAUDE.md — Project Guide for Claude Code

## What This Project Does

This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose.

---

## Key Files

| File | Role |
|---|---|
| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
| `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
| `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
| `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write |
| `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings |
| `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` |
| `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set |
| `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) |
| `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
| `web/index.php` | Search portal — dynamic UNION across all `da_*` tables |

---

## Running Things Locally

```bash
# Full stack
docker compose up -d

# Run all scrapers once
docker compose run --rm scraper /app/run_all.sh

# Run a single scraper
TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb

# Run a subset
ONLY=meandervalley,kent docker compose run --rm scraper /app/run_all.sh

# Geocode backfill (batch, all tables)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  scraper ruby /app/tools/backfill_geocode.rb

# Geocode backfill (single table)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  -e ONLY_TABLE=da_brighton \
  scraper ruby /app/tools/backfill_geocode.rb
```

---

## Architecture Conventions

### Each scraper follows this pattern:
1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
4. Parse with Nokogiri
5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches

### WAF / Cloudflare handling:
- `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
- For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
- Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
- The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`.

### Write-once fields (in `DB.upsert`):
- `date_received` — never overwritten once set
- `date_received_raw` — never overwritten once non-blank
- `document_url` / `local_document_url` — new value only replaces if existing is NULL

### Table names:
- Always derived from the scraper filename: `scrapers/foo.rb` → `da_foo`
- `run_all.sh` sets `TABLE_NAME=da_<basename>` before invoking each scraper
- The `COUNCIL_MAP` in `lib/util.rb` maps internal council keys to table names (used by PlanBuild integration)

---

## Error Handling Conventions

After a refactor, the project follows these rules:

- **URI building** (`URI.join`, `URI.parse`) → `rescue URI::InvalidURIError`
- **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; warn "[scraper] ..."`
- **Zlib decompression** → `rescue Zlib::Error`
- **Date parsing** (`Date.strptime`, `Date.parse`) → `rescue ArgumentError, Date::Error`
- **JSON parsing** → `rescue JSON::ParserError`
- **Network/HTTP** → `rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError`
- **Enrichment failures** always `warn` to stderr — do not gate them behind `ENRICH_DEBUG`
- **No bare `rescue`** — always specify the exception class(es)

---

## Adding or Modifying a Scraper

When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
- `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
- Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
- `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime`

**Template choice:**
- Simple HTML list/table → copy `glamorgan.rb`
- Link/PDF listing → copy `centralhighlands.rb`
- WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download)
- Multi-hop redirect to detail pages → copy `derwentvalley.rb`

The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.

---

## Database Notes

- MariaDB 10.11, `utf8mb4` encoding throughout
- Schema is created on-the-fly — `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE ... ADD COLUMN IF NOT EXISTS`
- Schema changes go in `lib/migrate.rb` (new migration at end of `MIGRATIONS` array) or `lib/db.rb` (`ensure_table!`) for columns every new table gets
- The `geo_cache` table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls
- The `UNIQUE KEY uniq_ref_addr (council_reference, address)` constraint drives the upsert behaviour

## Web Portal Notes

- `web/index.php` dynamically discovers all `da_*` tables and builds a UNION query
- It handles missing columns gracefully (not all tables have every column)
- `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible; consider moving it out of the web root or placing it behind authentication

---

## Common Gotchas

- **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
- **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
- **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
- **PDF downloads**: Only happen when `DOWNLOAD_ATTACHMENTS=1`. Files land in `DOWNLOAD_DIR/<councilname>/`. The web portal serves them from `/downloads/` via an Apache alias.
- **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
- **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.