# CLAUDE.md — Project Guide for Claude Code ## What This Project Does This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose. --- ## Key Files | File | Role | |---|---| | `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) | | `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback | | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table | | `lib/enrich.rb` | `enrich_after_upsert!` — geocoding + property lookup after each DB write | | `lib/util.rb` | `parse_aus_date`, council-name/table-name mappings | | `lib/scraper_helpers.rb` | Shared helpers: `abs_url`, `text_or`, `upsert_and_enrich!` | | `lib/migrate.rb` | Sequential schema migration runner — add new migrations at end of `MIGRATIONS` array | | `lib/llm.php` | LLM inference helper for PHP — calls Ollama-compatible API (llama-swap primary, Ollama fallback) | | `run_all.sh` | Discovers `scrapers/*.rb`, filters by `ONLY`/`SKIP`, runs each with `TABLE_NAME` set; prints summary table; emails on error | | `entrypoint.sh` | Docker entry; waits for DB then runs `run_all.sh` (looping if `SCRAPE_EVERY_MINUTES` is set) | | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` | | `web/index.php` | Search portal — dynamic UNION across all `da_*` tables | | `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP (called by `run_all.sh` when any scraper ERRORs) | | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) | --- ## Running Things Locally ```bash # Full stack docker compose up -d # Run all scrapers once docker compose run --rm scraper /app/run_all.sh # Run a single scraper TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb # Run a subset ONLY=meandervalley,westtamar docker compose run --rm scraper /app/run_all.sh # Geocode backfill (batch, all tables) docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="..." \ scraper ruby /app/tools/backfill_geocode.rb # Geocode backfill (single table) docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="..." \ -e ONLY_TABLE=da_brighton \ scraper ruby /app/tools/backfill_geocode.rb ``` --- ## Architecture Conventions ### Each scraper follows this pattern: 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included) 3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers) 4. Parse with Nokogiri 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received` 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches ### WAF / Cloudflare handling: - `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically. - For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation. - Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`. - The warmup pattern (custom `CookieJar` + `http_get` with redirect handling) is self-contained in scrapers that need it and does **not** depend on `lib/http.rb`. ### PDF Downloads - Only happen when `DOWNLOAD_ATTACHMENTS=1` (set in `docker-compose.yml` or at runtime) - Files land in `DOWNLOAD_DIR///filename.pdf` inside the container - The web container mounts the same folder at `/srv/files` and Apache serves it via `Alias /files /srv/files` - **`local_document_url` must be stored as `/files//...`** — not `/downloads/...`. The Apache alias is `/files`, not `/downloads`. - The web portal prefers `local_document_url` over `document_url` when rendering the document button - For multi-document DAs (e.g. Launceston), all docs are stored as JSON in `documents_json` and rendered as a list of buttons in the portal ### Write-once fields (in `DB.upsert`) - `date_received` — never overwritten once set - `date_received_raw` — never overwritten once non-blank - `document_url` / `local_document_url` — new value only replaces if existing is NULL ### Table names - Always derived from the scraper filename: `scrapers/foo.rb` → `da_foo` - `run_all.sh` sets `TABLE_NAME=da_` before invoking each scraper - The `COUNCIL_MAP` in `lib/util.rb` maps internal council keys to table names (used by PlanBuild integration) ### run_all.sh summary table - After all scrapers finish, prints a formatted table: Council | Saved | Warns | Status - Status values: `ok`, `warn`, `blocked` (Cloudflare), `ERROR` (non-zero exit) - Saved count: parsed from scraper stdout — looks for `"Saved N"` (case-insensitive) first, falls back to counting `"Upserted"` lines - All scrapers should end with `puts "Done #{TABLE}. Saved #{n} item(s)."` for correct summary parsing - If any scraper has ERROR status and `SMTP_HOST` is set, `tools/send_summary_email.rb` sends an HTML summary email --- ## Error Handling Conventions - **URI building** (`URI.join`, `URI.parse`) → `rescue URI::InvalidURIError` - **DB operations** (prepare/execute) → `rescue Mysql2::Error => e; Log.warn ...` - **Zlib decompression** → `rescue Zlib::Error` - **Date parsing** (`Date.strptime`, `Date.parse`) → `rescue ArgumentError, Date::Error` - **JSON parsing** → `rescue JSON::ParserError` - **Network/HTTP** → `rescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError` - **Enrichment failures** always `warn` to stderr — do not gate them behind `ENRICH_DEBUG` - **No bare `rescue`** — always specify the exception class(es) --- ## Adding or Modifying a Scraper When a council changes its website markup, only that scraper needs updating. The typical failure mode is: - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector - HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`) - Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning - `date_received` all nil — Date format changed; update the format string passed to `Util.parse_aus_date` or `Date.strptime` **Template choice:** - Simple HTML list/table with one entry per row → copy `glamorgan.rb` - Single page, entries grouped under `

` headings → copy `northernmidlands.rb` - Single page, entries under `

` with labeled `` fields + PDF in `
    ` → copy `westtamar.rb` - Link/PDF listing → copy `centralhighlands.rb` - WAF-protected site needing homepage warmup → copy `kingisland.rb` (minimal) or `burnie.rb` (full-featured with PDF download) - Multi-hop redirect to detail pages → copy `derwentvalley.rb` The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everything else. --- ## Database Notes - MariaDB 10.11, `utf8mb4` encoding throughout - Schema is created on-the-fly — `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` - Schema changes go in `lib/migrate.rb` (new migration at end of `MIGRATIONS` array) or `lib/db.rb` (`ensure_table!`) for columns every new table gets - The `geo_cache` table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls - The `UNIQUE KEY uniq_ref_addr (council_reference, address)` constraint drives the upsert behaviour - Current migration versions: v1 (enrichment/geocode columns), v2 (geo_cache table), v3 (documents_json), v4 (Launceston detail columns), v5 (rewrite /downloads/ → /files/ in local_document_url), v6 (LLM classification columns) ### Schema — notable columns added beyond base | Column | Type | Notes | | --- | --- | --- | | `documents_json` | MEDIUMTEXT | JSON array of `{name, url, local_url}` — used when a DA has multiple PDFs (e.g. Launceston) | | `status` | VARCHAR(100) | Application status text (Launceston eProperty) | | `assigned_officer` | VARCHAR(255) | Assigned planning officer (Launceston) | | `group` | VARCHAR(100) | Application group (Launceston) — reserved SQL word, always quoted | | `category` | VARCHAR(100) | Application category (Launceston) | | `application_valid` | DATE | Date application deemed valid (Launceston) | | `advertised_on` | DATE | Date first advertised (Launceston) | | `property_legal_description` | TEXT | Certificate of Title / legal description (Launceston) | --- ## Web Portal Notes - `web/index.php` dynamically discovers all `da_*` tables and builds a UNION query - It handles missing columns gracefully (not all tables have every column) - Document display: if `documents_json` is present → renders a button per document using the name from JSON; otherwise falls back to single "Open document" button using `local_document_url` → `document_url` - `web/backfill_pid_title.php` is a legacy admin tool — it should not be publicly accessible --- ## Email Summary `tools/send_summary_email.rb` is called by `run_all.sh` when any scraper exits with ERROR status. It: - Reads SMTP config from env vars: `SMTP_HOST`, `SMTP_PORT`, `SMTP_USERNAME`, `SMTP_PASSWORD`, `SMTP_SMTPSecure` (`tls`/`ssl`), `SMTP_SENTFROM`, `SMTP_ADDADDRESS` - Uses Ruby stdlib `net/smtp` — no gems required - Sends multipart (plain + HTML) email with colour-coded summary table - Silently skips if `SMTP_HOST` is not set --- ## Common Gotchas - **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script. - **`COUNCIL_FILTER` env var**: Used only by `scrapers/hobartcity.rb` to filter which councils to scrape from the Hobart eProperty portal. It has no effect on any other scraper or on `run_all.sh`. To run a subset of scrapers, use `ONLY` / `SKIP` in `run_all.sh`. - **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers. - **PDF download path**: `local_document_url` must begin with `/files/` (not `/downloads/`). The Apache alias in `web/000-files.conf` is `Alias /files /srv/files`. Using `/downloads/` results in 404 in the web portal. - **Binary PDF downloads**: Pass `headers: { "Accept" => "application/pdf,*/*", "Referer" => URL }` to `Http.get` when downloading PDFs from CDN subdomains — some CDNs reject requests without a valid referrer. - **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern. - **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`. - **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`. - **`group` column**: This is a reserved SQL word. In `DB.upsert` it is safe because all column names are backtick-quoted. In raw SQL always write `` `group` ``. --- ## Next Phase — LLM-Based PDF Classification ### Goal Extract structured information from downloaded DA PDFs using a local LLaMA model — primarily **application type** (Residential, Commercial, Industrial, Subdivision, etc.) but potentially other fields not reliably scraped from HTML (e.g. lot size, number of dwellings, value of works). ### LLM Infrastructure A local Ollama instance is running at `http://192.168.8.73:11434` (env var: `LLAMA_URL`). `lib/llm.php` (already in the repo) shows the integration pattern for PHP: - Primary backend: llama-swap via OpenAI-compatible `/v1/chat/completions` - Fallback: Ollama `/api/generate` - Config loaded from `config/ai.php` — `LLAMACPP_HOST`, `OLLAMA_HOST`, `LLAMACPP_MODEL`, `OLLAMA_MODEL`, etc. For the Ruby scraper pipeline the equivalent is a direct Ollama HTTP call (no gems needed — stdlib `net/http`): ```ruby # Minimal Ollama call — POST to /api/generate require "net/http" require "json" def llm_classify(text, model: "llama3.2") uri = URI("#{ENV.fetch('LLAMA_URL', 'http://192.168.8.73:11434')}/api/generate") body = JSON.generate(model: model, prompt: text, stream: false) res = Net::HTTP.post(uri, body, "Content-Type" => "application/json") JSON.parse(res.body)["response"].to_s.strip rescue StandardError => e warn "[llm] #{e.class}: #{e.message}" nil end ``` ### Proposed Pipeline ```text Downloaded PDF (local_document_url) │ ▼ Extract text (pdftotext CLI or pdf-reader gem) │ ▼ Prompt LLM → application_type string │ ▼ DB.upsert / UPDATE da_* SET application_type = ? ``` ### Suggested Prompt ```text You are classifying a Tasmanian planning development application. Read the following text and return ONLY the single most appropriate application type from this list: Residential, Commercial, Industrial, Subdivision, Rural/Agriculture, Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use, Demolition, Signage, Other Text: Reply with the type only. No explanation. ``` ### Schema Changes Needed ```sql -- Add to ensure_table! and as a new migration: application_type VARCHAR(60) NULL -- e.g. "Residential", "Subdivision" application_type_raw TEXT NULL -- full LLM response for debugging application_type_at DATETIME NULL -- when classification was last run ``` ### Implementation Options **Option A — Inline during scrape** (simplest): - Each scraper that downloads PDFs calls `llm_classify` immediately after download - Adds latency to each scrape run (LLM inference per PDF) - Suitable if the LLM is fast (< 5s per classification) **Option B — Backfill tool** (recommended): - New script `tools/classify_pdfs.rb` — iterates rows where `local_document_url IS NOT NULL AND application_type IS NULL` - Run separately from `run_all.sh`, on demand or on a cron - Supports `ONLY_TABLE` env var to process one council at a time - Safer — scrape failures don't block classification; can re-run without re-scraping **Option C — PHP tool in web container**: - New `tools/classify_pdfs.php` using the existing `lib/llm.php` - Reads PDFs from `/srv/files`, calls `llmGenerate`, updates DB - Advantage: reuses the already-written PHP LLM helper - Disadvantage: PDF text extraction harder in PHP (needs `pdftotext` shell call or a PHP PDF lib) ### PDF Text Extraction `pdftotext` (part of `poppler-utils`) is the most reliable option: ```ruby def extract_pdf_text(local_path, max_chars: 2000) # local_path is relative like "/files/northernmidlands/PLN-26-0030/doc.pdf" # Map to filesystem path inside container fs_path = local_path.sub(%r{\A/files/}, "#{ENV.fetch('DOWNLOAD_DIR', '/app/downloads')}/") return nil unless File.exist?(fs_path) text, = Open3.capture2("pdftotext", "-l", "3", fs_path, "-") text.to_s.gsub(/\s+/, " ").strip[0, max_chars] rescue StandardError => e warn "[classify] pdftotext failed for #{fs_path}: #{e.message}" nil end ``` `pdftotext` may need to be installed in the scraper Dockerfile: ```dockerfile RUN apt-get install -y poppler-utils ``` ### Key Decisions Before Implementation 1. **Option A vs B vs C** — inline vs backfill tool vs PHP 2. **Which model** — any Ollama model on the local server (check with `curl http://192.168.8.73:11434/api/tags`) 3. **Prompt language** — zero-shot classification vs few-shot examples; JSON output vs plain text 4. **Confidence threshold** — store raw LLM response for auditing? Flag low-confidence results? 5. **Re-classification** — should existing `application_type` values be overwritten on re-run, or treated as write-once? 6. **Dockerfile change** — confirm `poppler-utils` can be added to the scraper image