2 ヶ月前 · 3642c1be2a
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -9,7 +9,7 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 
															 ## Key Files
														
 
															 | File | Role |
														
 
															-|---|---|
														
 
															+| --- | --- |
														
 
															 | `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
														
 
															 | `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
														
 
															 | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
														
@@ -23,6 +23,7 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 
															 | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
														
 
															 | `web/index.php` | Search portal — dynamic UNION across all `da_*` tables |
														
 
															 | `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP (called by `run_all.sh` when any scraper ERRORs) |
														
 
															+| `tools/classify_pdfs.rb` | LLM PDF classification backfill — sets `application_type` on rows with a downloaded PDF |
														
 
															 | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
														
 
															 ---
														
@@ -58,7 +59,8 @@ docker compose run --rm \
 
															 ## Architecture Conventions
														
 
															-### Each scraper follows this pattern:
														
 
															+### Each scraper follows this pattern
														
 
															+
														
 
															 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
														
 
															 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
														
 
															 3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
														
@@ -66,7 +68,8 @@ docker compose run --rm \
 
															 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
														
 
															 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches
														
 
															-### WAF / Cloudflare handling:
														
 
															+### WAF / Cloudflare handling
														
 
															+
														
 
															 - `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
														
 
															 - For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
														
 
															 - Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
														
@@ -119,6 +122,7 @@ docker compose run --rm \
 
															 ## Adding or Modifying a Scraper
														
 
															 When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
														
 
															+
														
 
															 - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
														
 
															 - HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
														
 
															 - Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
														
@@ -316,12 +320,12 @@ end
 
															 RUN apt-get install -y poppler-utils
														
 
															 ```
														
 
															-### Key Decisions Before Implementation
														
 
															-
														
 
															-1. **Option A vs B vs C** — inline vs backfill tool vs PHP
														
 
															-2. **Which model** — any Ollama model on the local server (check with `curl http://192.168.8.73:11434/api/tags`)
														
 
															-3. **Prompt language** — zero-shot classification vs few-shot examples; JSON output vs plain text
														
 
															-4. **Confidence threshold** — store raw LLM response for auditing? Flag low-confidence results?
														
 
															-5. **Re-classification** — should existing `application_type` values be overwritten on re-run, or treated as write-once?
														
 
															-6. **Dockerfile change** — confirm `poppler-utils` can be added to the scraper image
														
 
															+### Implementation
														
 
															+- **Approach**: Option B (backfill tool) — `tools/classify_pdfs.rb` runs independently of `run_all.sh`
														
 
															+- **Model**: `llama3.2` (3B, fast) by default; override with `LLM_MODEL` env var
														
 
															+- **Prompt**: Zero-shot, plain-text response (no JSON overhead for a fixed classification list)
														
 
															+- **Raw response**: Always stored in `application_type_raw` for auditing
														
 
															+- **Re-classification**: Write-once by default; set `RECLASSIFY=1` to overwrite
														
 
															+- **PDF extraction**: `pdftotext -l 3` (first 3 pages, from `poppler-utils` in Dockerfile)
														
 
															+- **Qwen3 note**: The tool strips `<think>...</think>` tags from models that produce reasoning output before normalising the response
														
--- a/README.md
+++ b/README.md
@@ -77,6 +77,15 @@ Copy `.env.example` to `.env` and fill in the values. **Never commit `.env`.**
 
															 | `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) |
														
 
															 | `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments |
														
 
															 | `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) |
														
 
															+| `LLAMA_URL` | No | Base URL of local Ollama instance for PDF classification (default: `http://192.168.8.73:11434`) |
														
 
															+| `LLM_MODEL` | No | Ollama model name for PDF classification (default: `llama3.2`) |
														
 
															+| `SMTP_HOST` | No | SMTP server for error summary emails |
														
 
															+| `SMTP_PORT` | No | SMTP port (default: `587`) |
														
 
															+| `SMTP_USERNAME` | No | SMTP username |
														
 
															+| `SMTP_PASSWORD` | No | SMTP password |
														
 
															+| `SMTP_SMTPSecure` | No | `tls` or `ssl` (default: `tls`) |
														
 
															+| `SMTP_SENTFROM` | No | Sender email address |
														
 
															+| `SMTP_ADDADDRESS` | No | Recipient email address |
														
 
															 | `DEBUG` | No | Set to `1` for verbose scraper output |
														
 
															 | `DRY_RUN` | No | Set to `1` to parse without writing to the DB |
														
 
															 | `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output |
														
@@ -150,14 +159,24 @@ Every `da_*` table shares the same base schema:
 
															 | `on_notice_to` | `DATE` | Public comment close date |
														
 
															 | `applicant` | `VARCHAR(255)` | |
														
 
															 | `document_url` | `TEXT` | Remote PDF URL |
														
 
															-| `local_document_url` | `TEXT` | Downloaded PDF path (relative to `/downloads`) |
														
 
															+| `local_document_url` | `TEXT` | Downloaded PDF path (served via `/files/`) |
														
 
															+| `documents_json` | `MEDIUMTEXT` | JSON array of `{name, url, local_url}` — multi-doc DAs (e.g. Launceston) |
														
 
															 | `address_std` | `VARCHAR(255)` | Google-normalised address |
														
 
															 | `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates |
														
 
															 | `property_id` | `TEXT` | Land title PID |
														
 
															 | `title_reference` | `TEXT` | Certificate of title reference |
														
 
															+| `application_type` | `VARCHAR(60)` | LLM-classified type (e.g. `Residential`, `Subdivision`) |
														
 
															+| `application_type_raw` | `TEXT` | Raw LLM response (for auditing) |
														
 
															+| `application_type_at` | `DATETIME` | When classification was last run |
														
 
															+| `status` | `VARCHAR(100)` | Application status (Launceston eProperty) |
														
 
															+| `assigned_officer` | `VARCHAR(255)` | Assigned planning officer (Launceston) |
														
 
															+| `category` | `VARCHAR(100)` | Application category (Launceston) |
														
 
															+| `application_valid` | `DATE` | Date application was deemed valid (Launceston) |
														
 
															+| `advertised_on` | `DATE` | Date first advertised (Launceston) |
														
 
															+| `property_legal_description` | `TEXT` | Certificate of title / legal description (Launceston) |
														
 
															 | `created_at` / `updated_at` | `DATETIME` | |
														
 
															-Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`) — the first value is kept on subsequent scrapes.
														
 
															+Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`, `document_url`) — the first value is kept on subsequent scrapes.
														
 
															 ---
														
@@ -199,7 +218,7 @@ docker compose run --rm \
 
															 For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`.
														
 
															-**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly.
														
 
															+**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly. Where a PlanBuild equivalent exists, data is still collected via `planbuild.rb`.
														
 
															 ---
														
@@ -216,11 +235,46 @@ For sites that additionally require a **warm cookie state**, the scraper does a
 
															 ---
														
 
															+## PDF Classification (LLM)
														
 
															+
														
 
															+After PDFs are downloaded, `tools/classify_pdfs.rb` extracts text from each PDF using `pdftotext` and sends it to a local Ollama instance to classify the application type.
														
 
															+
														
 
															+**Application types:** Residential, Commercial, Industrial, Subdivision, Rural/Agriculture, Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use, Demolition, Signage, Other
														
 
															+
														
 
															+```bash
														
 
															+# Classify all unclassified PDFs (dry run first)
														
 
															+docker compose run --rm -e DRY_RUN=1 scraper ruby /app/tools/classify_pdfs.rb
														
 
															+
														
 
															+# Run for real
														
 
															+docker compose run --rm scraper ruby /app/tools/classify_pdfs.rb
														
 
															+
														
 
															+# Single council
														
 
															+docker compose run --rm -e ONLY_TABLE=da_northernmidlands scraper ruby /app/tools/classify_pdfs.rb
														
 
															+
														
 
															+# Re-classify existing (overwrite)
														
 
															+docker compose run --rm -e RECLASSIFY=1 scraper ruby /app/tools/classify_pdfs.rb
														
 
															+
														
 
															+# Use a different model
														
 
															+docker compose run --rm -e LLM_MODEL=gemma3 scraper ruby /app/tools/classify_pdfs.rb
														
 
															+```
														
 
															+
														
 
															+Results are written to `application_type`, `application_type_raw` (full LLM response for auditing), and `application_type_at` (timestamp). The web portal displays the type as a badge and supports filtering by type.
														
 
															+
														
 
															+---
														
 
															+
														
 
															+## Error Summary Emails
														
 
															+
														
 
															+When any scraper exits with an error, `run_all.sh` automatically calls `tools/send_summary_email.rb` to send an HTML summary email if `SMTP_HOST` is configured in `.env`. The email contains a colour-coded table of all scrapers with their saved counts and error status.
														
 
															+
														
 
															+---
														
 
															+
														
 
															 ## Tools
														
 
															 | Script | Purpose |
														
 
															 |---|---|
														
 
															 | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
														
 
															+| `tools/classify_pdfs.rb` | LLM classification of downloaded PDFs — sets `application_type` on each row |
														
 
															+| `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP — called by `run_all.sh` on ERROR |
														
 
															 | `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
														
 
															 | `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
														
 
															 | `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |
														
@@ -238,7 +292,8 @@ tas_councils/
 
															 │   ├── enrich.rb         # Post-upsert enrichment pipeline
														
 
															 │   ├── util.rb           # Date parsing, council/table name mappings
														
 
															 │   ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
														
 
															-│   └── migrate.rb        # Sequential schema migration runner
														
 
															+│   ├── migrate.rb        # Sequential schema migration runner
														
 
															+│   └── llm.php           # LLM inference helper for PHP (llama-swap + Ollama)
														
 
															 ├── scrapers/             # One .rb file per council
														
 
															 ├── web/                  # PHP search portal (Apache)
														
 
															 ├── tools/                # Standalone backfill and migration scripts