# CLAUDE.md — Project Guide for Claude Code

## What This Project Does

This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose.

---

## Key Files

| File | Role |
|---|---|
| `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
| `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
| `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
| `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write |
| `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings |
| `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` |
| `lib/migrate.rb` | Sequential schema migration runner — add new migrations at end of `MIGRATIONS` array |
| `lib/llm.php` | LLM inference helper for PHP — calls Ollama-compatible API (llama-swap primary, Ollama fallback) |
| `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set; prints summary table; emails on error |
| `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) |
| `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
| `web/index.php` | Search portal — dynamic UNION across all `da_*` tables |
| `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP (called by `run_all.sh` when any scraper ERRORs) |
| `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |

---

## Running Things Locally

```bash
# Full stack
docker compose up -d

# Run all scrapers once
docker compose run --rm scraper /app/run_all.sh

# Run a single scraper
TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb

# Run a subset
ONLY=meandervalley,westtamar docker compose run --rm scraper /app/run_all.sh

# Geocode backfill (batch, all tables)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  scraper ruby /app/tools/backfill_geocode.rb

# Geocode backfill (single table)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  -e ONLY_TABLE=da_brighton \
  scraper ruby /app/tools/backfill_geocode.rb
```

---

## Architecture Conventions

### Each scraper follows this pattern:
1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
4. Parse with Nokogiri
5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches

### WAF / Cloudflare handling:
- `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
- For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
- Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
- The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`.

### PDF Downloads

- Only happen when `DOWNLOAD_ATTACHMENTS=1` (set in `docker-compose.yml` or at runtime)
- Files land in `DOWNLOAD_DIR/<councilname>/<ref>/filename.pdf` inside the container
- The web container mounts the same folder at `/srv/files` and Apache serves it via `Alias /files /srv/files`
- **`local_document_url` must be stored as `/files/<councilname>/...`** — not `/downloads/...`. The Apache alias is `/files`, not `/downloads`.
- The web portal prefers `local_document_url` over `document_url` when rendering the document button
- For multi-document DAs (e.g. Launceston), all docs are stored as JSON in `documents_json` and rendered as a list of buttons in the portal

### Write-once fields (in `DB.upsert`)

- `date_received` — never overwritten once set
- `date_received_raw` — never overwritten once non-blank
- `document_url` / `local_document_url` — new value only replaces if existing is NULL

### Table names

- Always derived from the scraper filename: `scrapers/foo.rb` → `da_foo`
- `run_all.sh` sets `TABLE_NAME=da_<basename>` before invoking each scraper
- The `COUNCIL_MAP` in `lib/util.rb` maps internal council keys to table names (used by PlanBuild integration)

### run_all.sh summary table

- After all scrapers finish, prints a formatted table: Council | Saved | Warns | Status
- Status values: `ok`, `warn`, `blocked` (Cloudflare), `ERROR` (non-zero exit)
- Saved count: parsed from scraper stdout — looks for `"Saved N"` (case-insensitive) first, falls back to counting `"Upserted"` lines
- All scrapers should end with `puts "Done #{TABLE}. Saved #{n} item(s)."` for correct summary parsing
- If any scraper has ERROR status and `SMTP_HOST` is set, `tools/send_summary_email.rb` sends an HTML summary email

---

## Error Handling Conventions

- **URI building** (`URI.join`, `URI.parse`) → `rescue URI::InvalidURIError`
- **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; Log.warn ...`
- **Zlib decompression** → `rescue Zlib::Error`
- **Date parsing** (`Date.strptime`, `Date.parse`) → `rescue ArgumentError, Date::Error`
- **JSON parsing** → `rescue JSON::ParserError`
- **Network/HTTP** → `rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError`
- **Enrichment failures** always `warn` to stderr — do not gate them behind `ENRICH_DEBUG`
- **No bare `rescue`** — always specify the exception class(es)

---

## Adding or Modifying a Scraper

When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
- `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
- HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
- Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
- `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime`

**Template choice:**

- Simple HTML list/table with one entry per row → copy `glamorgan.rb`
- Single page, entries grouped under `<h2>` headings → copy `northernmidlands.rb`
- Single page, entries under `<h2>` with labeled `<strong>` fields + PDF in `<ul>` → copy `westtamar.rb`
- Link/PDF listing → copy `centralhighlands.rb`
- WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download)
- Multi-hop redirect to detail pages → copy `derwentvalley.rb`

The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else.

---

## Database Notes

- MariaDB 10.11, `utf8mb4` encoding throughout
- Schema is created on-the-fly — `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE ... ADD COLUMN IF NOT EXISTS`
- Schema changes go in `lib/migrate.rb` (new migration at end of `MIGRATIONS` array) or `lib/db.rb` (`ensure_table!`) for columns every new table gets
- The `geo_cache` table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls
- The `UNIQUE KEY uniq_ref_addr (council_reference, address)` constraint drives the upsert behaviour
- Current migration versions: v1 (enrichment/geocode columns), v2 (geo_cache table), v3 (documents_json), v4 (Launceston detail columns), v5 (rewrite /downloads/ → /files/ in local_document_url), v6 (LLM classification columns)

### Schema — notable columns added beyond base

| Column | Type | Notes |
| --- | --- | --- |
| `documents_json` | MEDIUMTEXT | JSON array of `{name, url, local_url}` — used when a DA has multiple PDFs (e.g. Launceston) |
| `status` | VARCHAR(100) | Application status text (Launceston eProperty) |
| `assigned_officer` | VARCHAR(255) | Assigned planning officer (Launceston) |
| `group` | VARCHAR(100) | Application group (Launceston) — reserved SQL word, always quoted |
| `category` | VARCHAR(100) | Application category (Launceston) |
| `application_valid` | DATE | Date application deemed valid (Launceston) |
| `advertised_on` | DATE | Date first advertised (Launceston) |
| `property_legal_description` | TEXT | Certificate of Title / legal description (Launceston) |

---

## Web Portal Notes

- `web/index.php` dynamically discovers all `da_*` tables and builds a UNION query
- It handles missing columns gracefully (not all tables have every column)
- Document display: if `documents_json` is present → renders a button per document using the name from JSON; otherwise falls back to single "Open document" button using `local_document_url` → `document_url`
- `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible

---

## Email Summary

`tools/send_summary_email.rb` is called by `run_all.sh` when any scraper exits with ERROR status. It:

- Reads SMTP config from env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USERNAME`, `SMTP_PASSWORD`, `SMTP_SMTPSecure` (`tls`/`ssl`), `SMTP_SENTFROM`, `SMTP_ADDADDRESS`
- Uses Ruby stdlib `net/smtp` — no gems required
- Sends multipart (plain + HTML) email with colour-coded summary table
- Silently skips if `SMTP_HOST` is not set

---

## Common Gotchas

- **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
- **`COUNCIL_FILTER` env var**: Used only by `scrapers/hobartcity.rb` to filter which councils to scrape from the Hobart eProperty portal. It has no effect on any other scraper or on `run_all.sh`. To run a subset of scrapers, use `ONLY` / `SKIP` in `run_all.sh`.
- **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
- **PDF download path**: `local_document_url` must begin with `/files/` (not `/downloads/`). The Apache alias in `web/000-files.conf` is `Alias /files /srv/files`. Using `/downloads/` results in 404 in the web portal.
- **Binary PDF downloads**: Pass `headers: { "Accept" => "application/pdf,*/*", "Referer" => URL }` to `Http.get` when downloading PDFs from CDN subdomains — some CDNs reject requests without a valid referrer.
- **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
- **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
- **`group` column**: This is a reserved SQL word. In `DB.upsert` it is safe because all column names are backtick-quoted. In raw SQL always write `` `group` ``.

---

## Next Phase — LLM-Based PDF Classification

### Goal

Extract structured information from downloaded DA PDFs using a local LLaMA model — primarily **application type** (Residential, Commercial, Industrial, Subdivision, etc.) but potentially other fields not reliably scraped from HTML (e.g. lot size, number of dwellings, value of works).

### LLM Infrastructure

A local Ollama instance is running at `http://192.168.8.73:11434` (env var: `LLAMA_URL`).

`lib/llm.php` (already in the repo) shows the integration pattern for PHP:

- Primary backend: llama-swap via OpenAI-compatible `/v1/chat/completions`
- Fallback: Ollama `/api/generate`
- Config loaded from `config/ai.php` — `LLAMACPP_HOST`, `OLLAMA_HOST`, `LLAMACPP_MODEL`, `OLLAMA_MODEL`, etc.

For the Ruby scraper pipeline the equivalent is a direct Ollama HTTP call (no gems needed — stdlib `net/http`):

```ruby
# Minimal Ollama call — POST to /api/generate
require "net/http"
require "json"

def llm_classify(text, model: "llama3.2")
  uri  = URI("#{ENV.fetch('LLAMA_URL', 'http://192.168.8.73:11434')}/api/generate")
  body = JSON.generate(model: model, prompt: text, stream: false)
  res  = Net::HTTP.post(uri, body, "Content-Type" => "application/json")
  JSON.parse(res.body)["response"].to_s.strip
rescue StandardError => e
  warn "[llm] #{e.class}: #{e.message}"
  nil
end
```

### Proposed Pipeline

```text
Downloaded PDF (local_document_url)
    │
    ▼
Extract text (pdftotext CLI or pdf-reader gem)
    │
    ▼
Prompt LLM → application_type string
    │
    ▼
DB.upsert / UPDATE da_* SET application_type = ?
```

### Suggested Prompt

```text
You are classifying a Tasmanian planning development application.
Read the following text and return ONLY the single most appropriate
application type from this list:
  Residential, Commercial, Industrial, Subdivision, Rural/Agriculture,
  Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use,
  Demolition, Signage, Other

Text:
<first 1500 characters of PDF text>

Reply with the type only. No explanation.
```

### Schema Changes Needed

```sql
-- Add to ensure_table! and as a new migration:
application_type VARCHAR(60) NULL   -- e.g. "Residential", "Subdivision"
application_type_raw TEXT NULL      -- full LLM response for debugging
application_type_at DATETIME NULL   -- when classification was last run
```

### Implementation Options

**Option A — Inline during scrape** (simplest):

- Each scraper that downloads PDFs calls `llm_classify` immediately after download
- Adds latency to each scrape run (LLM inference per PDF)
- Suitable if the LLM is fast (< 5s per classification)

**Option B — Backfill tool** (recommended):

- New script `tools/classify_pdfs.rb` — iterates rows where `local_document_url IS NOT NULL AND application_type IS NULL`
- Run separately from `run_all.sh`, on demand or on a cron
- Supports `ONLY_TABLE` env var to process one council at a time
- Safer — scrape failures don't block classification; can re-run without re-scraping

**Option C — PHP tool in web container**:

- New `tools/classify_pdfs.php` using the existing `lib/llm.php`
- Reads PDFs from `/srv/files`, calls `llmGenerate`, updates DB
- Advantage: reuses the already-written PHP LLM helper
- Disadvantage: PDF text extraction harder in PHP (needs `pdftotext` shell call or a PHP PDF lib)

### PDF Text Extraction

`pdftotext` (part of `poppler-utils`) is the most reliable option:

```ruby
def extract_pdf_text(local_path, max_chars: 2000)
  # local_path is relative like "/files/northernmidlands/PLN-26-0030/doc.pdf"
  # Map to filesystem path inside container
  fs_path = local_path.sub(%r{\A/files/}, "#{ENV.fetch('DOWNLOAD_DIR', '/app/downloads')}/")
  return nil unless File.exist?(fs_path)

  text, = Open3.capture2("pdftotext", "-l", "3", fs_path, "-")
  text.to_s.gsub(/\s+/, " ").strip[0, max_chars]
rescue StandardError => e
  warn "[classify] pdftotext failed for #{fs_path}: #{e.message}"
  nil
end
```

`pdftotext` may need to be installed in the scraper Dockerfile:

```dockerfile
RUN apt-get install -y poppler-utils
```

### Key Decisions Before Implementation

1. **Option A vs B vs C** — inline vs backfill tool vs PHP
2. **Which model** — any Ollama model on the local server (check with `curl http://192.168.8.73:11434/api/tags`)
3. **Prompt language** — zero-shot classification vs few-shot examples; JSON output vs plain text
4. **Confidence threshold** — store raw LLM response for auditing? Flag low-confidence results?
5. **Re-classification** — should existing `application_type` values be overwritten on re-run, or treated as write-once?
6. **Dockerfile change** — confirm `poppler-utils` can be added to the scraper image