2 ماه پیش · 3642c1be2a
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -9,7 +9,7 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 
				 ## Key Files
			
 
				 
			
 
				 | File | Role |
			
 
				-|---|---|
			
 
				+| --- | --- |
			
 
				 | `lib/db.rb` | DB client, `ensure_table!`, `upsert` (dynamic columns, write-once semantics) |
			
 
				 | `lib/http.rb` | HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
			
 
				 | `lib/geocode.rb` | Google Maps geocoding with SHA1 cache in `geo_cache` table |
			
@@ -23,6 +23,7 @@ This is a scraping pipeline that collects Tasmanian planning development applica
 
				 | `scrapers/*.rb` | One scraper per council — parses HTML, upserts rows, calls `enrich_after_upsert!` |
			
 
				 | `web/index.php` | Search portal — dynamic UNION across all `da_*` tables |
			
 
				 | `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP (called by `run_all.sh` when any scraper ERRORs) |
			
 
				+| `tools/classify_pdfs.rb` | LLM PDF classification backfill — sets `application_type` on rows with a downloaded PDF |
			
 
				 | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
			
 
				 
			
 
				 ---
			
@@ -58,7 +59,8 @@ docker compose run --rm \
 
				 
			
 
				 ## Architecture Conventions
			
 
				 
			
 
				-### Each scraper follows this pattern:
			
 
				+### Each scraper follows this pattern
			
 
				+
			
 
				 1. `TABLE = ENV.fetch("TABLE_NAME")` — set by `run_all.sh` from the filename
			
 
				 2. `DB.ensure_table!(TABLE)` — idempotent schema setup (all columns already included)
			
 
				 3. Fetch HTML via `Http.get(url)` (handles retries, cookies, WAF warmup, browser-fingerprint headers)
			
@@ -66,7 +68,8 @@ docker compose run --rm \
 
				 5. `DB.upsert(TABLE, row)` — upserts on `(council_reference, address)`, write-once for `date_received`
			
 
				 6. `enrich_after_upsert!(table:, council_reference:, address:)` — geocodes and enriches
			
 
				 
			
 
				-### WAF / Cloudflare handling:
			
 
				+### WAF / Cloudflare handling
			
 
				+
			
 
				 - `lib/http.rb` sends a full browser fingerprint on every request: `User-Agent`, `sec-ch-ua*`, `Sec-Fetch-*`, `Upgrade-Insecure-Requests`. This satisfies most WAF header checks automatically.
			
 
				 - For sites that also need a **warm cookie state** (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see `burnie.rb` as the reference implementation.
			
 
				 - Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
			
@@ -119,6 +122,7 @@ docker compose run --rm \
 
				 ## Adding or Modifying a Scraper
			
 
				 
			
 
				 When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
			
 
				+
			
 
				 - `Found 0 rows` — CSS selector no longer matches; inspect the live page and update the selector
			
 
				 - HTTP 403/406 — Council site added WAF; check `Http.get` options or add a proactive warmup step (see `burnie.rb`)
			
 
				 - Cloudflare JS challenge (`"Just a moment"` in body) — cannot be solved in Ruby; exit cleanly with a warning
			
@@ -316,12 +320,12 @@ end
 
				 RUN apt-get install -y poppler-utils
			
 
				 ```
			
 
				 
			
 
				-### Key Decisions Before Implementation
			
 
				-
			
 
				-1. **Option A vs B vs C** — inline vs backfill tool vs PHP
			
 
				-2. **Which model** — any Ollama model on the local server (check with `curl http://192.168.8.73:11434/api/tags`)
			
 
				-3. **Prompt language** — zero-shot classification vs few-shot examples; JSON output vs plain text
			
 
				-4. **Confidence threshold** — store raw LLM response for auditing? Flag low-confidence results?
			
 
				-5. **Re-classification** — should existing `application_type` values be overwritten on re-run, or treated as write-once?
			
 
				-6. **Dockerfile change** — confirm `poppler-utils` can be added to the scraper image
			
 
				+### Implementation
			
 
				 
			
 
				+- **Approach**: Option B (backfill tool) — `tools/classify_pdfs.rb` runs independently of `run_all.sh`
			
 
				+- **Model**: `llama3.2` (3B, fast) by default; override with `LLM_MODEL` env var
			
 
				+- **Prompt**: Zero-shot, plain-text response (no JSON overhead for a fixed classification list)
			
 
				+- **Raw response**: Always stored in `application_type_raw` for auditing
			
 
				+- **Re-classification**: Write-once by default; set `RECLASSIFY=1` to overwrite
			
 
				+- **PDF extraction**: `pdftotext -l 3` (first 3 pages, from `poppler-utils` in Dockerfile)
			
 
				+- **Qwen3 note**: The tool strips `<think>...</think>` tags from models that produce reasoning output before normalising the response
			
--- a/README.md
+++ b/README.md
@@ -77,6 +77,15 @@ Copy `.env.example` to `.env` and fill in the values. **Never commit `.env`.**
 
				 | `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) |
			
 
				 | `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments |
			
 
				 | `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) |
			
 
				+| `LLAMA_URL` | No | Base URL of local Ollama instance for PDF classification (default: `http://192.168.8.73:11434`) |
			
 
				+| `LLM_MODEL` | No | Ollama model name for PDF classification (default: `llama3.2`) |
			
 
				+| `SMTP_HOST` | No | SMTP server for error summary emails |
			
 
				+| `SMTP_PORT` | No | SMTP port (default: `587`) |
			
 
				+| `SMTP_USERNAME` | No | SMTP username |
			
 
				+| `SMTP_PASSWORD` | No | SMTP password |
			
 
				+| `SMTP_SMTPSecure` | No | `tls` or `ssl` (default: `tls`) |
			
 
				+| `SMTP_SENTFROM` | No | Sender email address |
			
 
				+| `SMTP_ADDADDRESS` | No | Recipient email address |
			
 
				 | `DEBUG` | No | Set to `1` for verbose scraper output |
			
 
				 | `DRY_RUN` | No | Set to `1` to parse without writing to the DB |
			
 
				 | `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output |
			
@@ -150,14 +159,24 @@ Every `da_*` table shares the same base schema:
 
				 | `on_notice_to` | `DATE` | Public comment close date |
			
 
				 | `applicant` | `VARCHAR(255)` | |
			
 
				 | `document_url` | `TEXT` | Remote PDF URL |
			
 
				-| `local_document_url` | `TEXT` | Downloaded PDF path (relative to `/downloads`) |
			
 
				+| `local_document_url` | `TEXT` | Downloaded PDF path (served via `/files/`) |
			
 
				+| `documents_json` | `MEDIUMTEXT` | JSON array of `{name, url, local_url}` — multi-doc DAs (e.g. Launceston) |
			
 
				 | `address_std` | `VARCHAR(255)` | Google-normalised address |
			
 
				 | `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates |
			
 
				 | `property_id` | `TEXT` | Land title PID |
			
 
				 | `title_reference` | `TEXT` | Certificate of title reference |
			
 
				+| `application_type` | `VARCHAR(60)` | LLM-classified type (e.g. `Residential`, `Subdivision`) |
			
 
				+| `application_type_raw` | `TEXT` | Raw LLM response (for auditing) |
			
 
				+| `application_type_at` | `DATETIME` | When classification was last run |
			
 
				+| `status` | `VARCHAR(100)` | Application status (Launceston eProperty) |
			
 
				+| `assigned_officer` | `VARCHAR(255)` | Assigned planning officer (Launceston) |
			
 
				+| `category` | `VARCHAR(100)` | Application category (Launceston) |
			
 
				+| `application_valid` | `DATE` | Date application was deemed valid (Launceston) |
			
 
				+| `advertised_on` | `DATE` | Date first advertised (Launceston) |
			
 
				+| `property_legal_description` | `TEXT` | Certificate of title / legal description (Launceston) |
			
 
				 | `created_at` / `updated_at` | `DATETIME` | |
			
 
				 
			
 
				-Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`) — the first value is kept on subsequent scrapes.
			
 
				+Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`, `document_url`) — the first value is kept on subsequent scrapes.
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -199,7 +218,7 @@ docker compose run --rm \
 
				 
			
 
				 For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`.
			
 
				 
			
 
				-**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly.
			
 
				+**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly. Where a PlanBuild equivalent exists, data is still collected via `planbuild.rb`.
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -216,11 +235,46 @@ For sites that additionally require a **warm cookie state**, the scraper does a
 
				 
			
 
				 ---
			
 
				 
			
 
				+## PDF Classification (LLM)
			
 
				+
			
 
				+After PDFs are downloaded, `tools/classify_pdfs.rb` extracts text from each PDF using `pdftotext` and sends it to a local Ollama instance to classify the application type.
			
 
				+
			
 
				+**Application types:** Residential, Commercial, Industrial, Subdivision, Rural/Agriculture, Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use, Demolition, Signage, Other
			
 
				+
			
 
				+```bash
			
 
				+# Classify all unclassified PDFs (dry run first)
			
 
				+docker compose run --rm -e DRY_RUN=1 scraper ruby /app/tools/classify_pdfs.rb
			
 
				+
			
 
				+# Run for real
			
 
				+docker compose run --rm scraper ruby /app/tools/classify_pdfs.rb
			
 
				+
			
 
				+# Single council
			
 
				+docker compose run --rm -e ONLY_TABLE=da_northernmidlands scraper ruby /app/tools/classify_pdfs.rb
			
 
				+
			
 
				+# Re-classify existing (overwrite)
			
 
				+docker compose run --rm -e RECLASSIFY=1 scraper ruby /app/tools/classify_pdfs.rb
			
 
				+
			
 
				+# Use a different model
			
 
				+docker compose run --rm -e LLM_MODEL=gemma3 scraper ruby /app/tools/classify_pdfs.rb
			
 
				+```
			
 
				+
			
 
				+Results are written to `application_type`, `application_type_raw` (full LLM response for auditing), and `application_type_at` (timestamp). The web portal displays the type as a badge and supports filtering by type.
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## Error Summary Emails
			
 
				+
			
 
				+When any scraper exits with an error, `run_all.sh` automatically calls `tools/send_summary_email.rb` to send an HTML summary email if `SMTP_HOST` is configured in `.env`. The email contains a colour-coded table of all scrapers with their saved counts and error status.
			
 
				+
			
 
				+---
			
 
				+
			
 
				 ## Tools
			
 
				 
			
 
				 | Script | Purpose |
			
 
				 |---|---|
			
 
				 | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
			
 
				+| `tools/classify_pdfs.rb` | LLM classification of downloaded PDFs — sets `application_type` on each row |
			
 
				+| `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP — called by `run_all.sh` on ERROR |
			
 
				 | `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
			
 
				 | `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
			
 
				 | `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |
			
@@ -238,7 +292,8 @@ tas_councils/
 
				 │   ├── enrich.rb         # Post-upsert enrichment pipeline
			
 
				 │   ├── util.rb           # Date parsing, council/table name mappings
			
 
				 │   ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
			
 
				-│   └── migrate.rb        # Sequential schema migration runner
			
 
				+│   ├── migrate.rb        # Sequential schema migration runner
			
 
				+│   └── llm.php           # LLM inference helper for PHP (llama-swap + Ollama)
			
 
				 ├── scrapers/             # One .rb file per council
			
 
				 ├── web/                  # PHP search portal (Apache)
			
 
				 ├── tools/                # Standalone backfill and migration scripts