|
@@ -77,6 +77,15 @@ Copy `.env.example` to `.env` and fill in the values. **Never commit `.env`.**
|
|
|
| `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) |
|
|
| `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) |
|
|
|
| `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments |
|
|
| `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments |
|
|
|
| `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) |
|
|
| `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) |
|
|
|
|
|
+| `LLAMA_URL` | No | Base URL of local Ollama instance for PDF classification (default: `http://192.168.8.73:11434`) |
|
|
|
|
|
+| `LLM_MODEL` | No | Ollama model name for PDF classification (default: `llama3.2`) |
|
|
|
|
|
+| `SMTP_HOST` | No | SMTP server for error summary emails |
|
|
|
|
|
+| `SMTP_PORT` | No | SMTP port (default: `587`) |
|
|
|
|
|
+| `SMTP_USERNAME` | No | SMTP username |
|
|
|
|
|
+| `SMTP_PASSWORD` | No | SMTP password |
|
|
|
|
|
+| `SMTP_SMTPSecure` | No | `tls` or `ssl` (default: `tls`) |
|
|
|
|
|
+| `SMTP_SENTFROM` | No | Sender email address |
|
|
|
|
|
+| `SMTP_ADDADDRESS` | No | Recipient email address |
|
|
|
| `DEBUG` | No | Set to `1` for verbose scraper output |
|
|
| `DEBUG` | No | Set to `1` for verbose scraper output |
|
|
|
| `DRY_RUN` | No | Set to `1` to parse without writing to the DB |
|
|
| `DRY_RUN` | No | Set to `1` to parse without writing to the DB |
|
|
|
| `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output |
|
|
| `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output |
|
|
@@ -150,14 +159,24 @@ Every `da_*` table shares the same base schema:
|
|
|
| `on_notice_to` | `DATE` | Public comment close date |
|
|
| `on_notice_to` | `DATE` | Public comment close date |
|
|
|
| `applicant` | `VARCHAR(255)` | |
|
|
| `applicant` | `VARCHAR(255)` | |
|
|
|
| `document_url` | `TEXT` | Remote PDF URL |
|
|
| `document_url` | `TEXT` | Remote PDF URL |
|
|
|
-| `local_document_url` | `TEXT` | Downloaded PDF path (relative to `/downloads`) |
|
|
|
|
|
|
|
+| `local_document_url` | `TEXT` | Downloaded PDF path (served via `/files/`) |
|
|
|
|
|
+| `documents_json` | `MEDIUMTEXT` | JSON array of `{name, url, local_url}` — multi-doc DAs (e.g. Launceston) |
|
|
|
| `address_std` | `VARCHAR(255)` | Google-normalised address |
|
|
| `address_std` | `VARCHAR(255)` | Google-normalised address |
|
|
|
| `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates |
|
|
| `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates |
|
|
|
| `property_id` | `TEXT` | Land title PID |
|
|
| `property_id` | `TEXT` | Land title PID |
|
|
|
| `title_reference` | `TEXT` | Certificate of title reference |
|
|
| `title_reference` | `TEXT` | Certificate of title reference |
|
|
|
|
|
+| `application_type` | `VARCHAR(60)` | LLM-classified type (e.g. `Residential`, `Subdivision`) |
|
|
|
|
|
+| `application_type_raw` | `TEXT` | Raw LLM response (for auditing) |
|
|
|
|
|
+| `application_type_at` | `DATETIME` | When classification was last run |
|
|
|
|
|
+| `status` | `VARCHAR(100)` | Application status (Launceston eProperty) |
|
|
|
|
|
+| `assigned_officer` | `VARCHAR(255)` | Assigned planning officer (Launceston) |
|
|
|
|
|
+| `category` | `VARCHAR(100)` | Application category (Launceston) |
|
|
|
|
|
+| `application_valid` | `DATE` | Date application was deemed valid (Launceston) |
|
|
|
|
|
+| `advertised_on` | `DATE` | Date first advertised (Launceston) |
|
|
|
|
|
+| `property_legal_description` | `TEXT` | Certificate of title / legal description (Launceston) |
|
|
|
| `created_at` / `updated_at` | `DATETIME` | |
|
|
| `created_at` / `updated_at` | `DATETIME` | |
|
|
|
|
|
|
|
|
-Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`) — the first value is kept on subsequent scrapes.
|
|
|
|
|
|
|
+Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`, `document_url`) — the first value is kept on subsequent scrapes.
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
@@ -199,7 +218,7 @@ docker compose run --rm \
|
|
|
|
|
|
|
|
For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`.
|
|
For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`.
|
|
|
|
|
|
|
|
-**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly.
|
|
|
|
|
|
|
+**Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly. Where a PlanBuild equivalent exists, data is still collected via `planbuild.rb`.
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
@@ -216,11 +235,46 @@ For sites that additionally require a **warm cookie state**, the scraper does a
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
|
|
+## PDF Classification (LLM)
|
|
|
|
|
+
|
|
|
|
|
+After PDFs are downloaded, `tools/classify_pdfs.rb` extracts text from each PDF using `pdftotext` and sends it to a local Ollama instance to classify the application type.
|
|
|
|
|
+
|
|
|
|
|
+**Application types:** Residential, Commercial, Industrial, Subdivision, Rural/Agriculture, Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use, Demolition, Signage, Other
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+# Classify all unclassified PDFs (dry run first)
|
|
|
|
|
+docker compose run --rm -e DRY_RUN=1 scraper ruby /app/tools/classify_pdfs.rb
|
|
|
|
|
+
|
|
|
|
|
+# Run for real
|
|
|
|
|
+docker compose run --rm scraper ruby /app/tools/classify_pdfs.rb
|
|
|
|
|
+
|
|
|
|
|
+# Single council
|
|
|
|
|
+docker compose run --rm -e ONLY_TABLE=da_northernmidlands scraper ruby /app/tools/classify_pdfs.rb
|
|
|
|
|
+
|
|
|
|
|
+# Re-classify existing (overwrite)
|
|
|
|
|
+docker compose run --rm -e RECLASSIFY=1 scraper ruby /app/tools/classify_pdfs.rb
|
|
|
|
|
+
|
|
|
|
|
+# Use a different model
|
|
|
|
|
+docker compose run --rm -e LLM_MODEL=gemma3 scraper ruby /app/tools/classify_pdfs.rb
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+Results are written to `application_type`, `application_type_raw` (full LLM response for auditing), and `application_type_at` (timestamp). The web portal displays the type as a badge and supports filtering by type.
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Error Summary Emails
|
|
|
|
|
+
|
|
|
|
|
+When any scraper exits with an error, `run_all.sh` automatically calls `tools/send_summary_email.rb` to send an HTML summary email if `SMTP_HOST` is configured in `.env`. The email contains a colour-coded table of all scrapers with their saved counts and error status.
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
## Tools
|
|
## Tools
|
|
|
|
|
|
|
|
| Script | Purpose |
|
|
| Script | Purpose |
|
|
|
|---|---|
|
|
|---|---|
|
|
|
| `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
|
|
| `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) |
|
|
|
|
|
+| `tools/classify_pdfs.rb` | LLM classification of downloaded PDFs — sets `application_type` on each row |
|
|
|
|
|
+| `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP — called by `run_all.sh` on ERROR |
|
|
|
| `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
|
|
| `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
|
|
|
| `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
|
|
| `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
|
|
|
| `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |
|
|
| `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |
|
|
@@ -238,7 +292,8 @@ tas_councils/
|
|
|
│ ├── enrich.rb # Post-upsert enrichment pipeline
|
|
│ ├── enrich.rb # Post-upsert enrichment pipeline
|
|
|
│ ├── util.rb # Date parsing, council/table name mappings
|
|
│ ├── util.rb # Date parsing, council/table name mappings
|
|
|
│ ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
|
|
│ ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
|
|
|
-│ └── migrate.rb # Sequential schema migration runner
|
|
|
|
|
|
|
+│ ├── migrate.rb # Sequential schema migration runner
|
|
|
|
|
+│ └── llm.php # LLM inference helper for PHP (llama-swap + Ollama)
|
|
|
├── scrapers/ # One .rb file per council
|
|
├── scrapers/ # One .rb file per council
|
|
|
├── web/ # PHP search portal (Apache)
|
|
├── web/ # PHP search portal (Apache)
|
|
|
├── tools/ # Standalone backfill and migration scripts
|
|
├── tools/ # Standalone backfill and migration scripts
|