# TAS Councils Planning Applications Scraper A web scraping and data aggregation system for Tasmanian development applications (DAs). It collects planning application notices from all 29 Tasmanian council websites, normalises and geocodes the data, and exposes it via a PHP search portal. See [VERSIONS.md](VERSIONS.md) for the changelog. --- ## Architecture ``` ┌─────────────────────────────────────────────────────┐ │ 29 Ruby scrapers (scrapers/*.rb) │ │ Each polls one council website on a schedule │ └──────────────────────┬──────────────────────────────┘ │ upserts rows ▼ MariaDB (da_* tables) │ ┌─────────────┴─────────────┐ │ │ PHP web portal Adminer UI (web/index.php) port 9980 port 9981 ``` **Services (Docker Compose):** | Service | Image | Port | Purpose | |---|---|---|---| | `db` | `mariadb:10.11` | 3306 | Database | | `scraper` | Custom (Ruby 3.2) | — | Runs all scrapers on a schedule | | `web` | Custom (PHP/Apache) | 9981 | Search portal | | `adminer` | `adminer` | 9980 | Database admin UI | --- ## Quick Start ### 1. Copy and configure environment ```bash cp .env.example .env # Edit .env — set DB passwords and your Google Maps API key ``` ### 2. Start all services ```bash docker compose up -d ``` - Web portal: http://localhost:9981 - Adminer: http://localhost:9980 ### 3. Run scrapers manually (once) ```bash docker compose run --rm scraper /app/run_all.sh ``` --- ## Environment Variables Copy `.env.example` to `.env` and fill in the values. **Never commit `.env`.** | Variable | Required | Description | |---|---|---| | `MYSQL_DATABASE` | Yes | Database name (default: `planning_scrapes`) | | `MYSQL_USER` | Yes | Database username | | `MYSQL_PASSWORD` | Yes | Database password | | `MYSQL_ROOT_PASSWORD` | Yes | MariaDB root password | | `GOOGLE_MAPS_API_KEY` | Yes | Used to geocode DA addresses | | `LOOKUP_URL` | No | URL of the property lookup service (PID/title enrichment) | | `LOOKUP_THROTTLE_MS` | No | Milliseconds between lookup requests (default: 150) | | `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) | | `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments | | `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) | | `LLAMA_URL` | No | Base URL of local Ollama instance for PDF classification (default: `http://192.168.8.73:11434`) | | `LLM_MODEL` | No | Ollama model name for PDF classification (default: `llama3.2`) | | `SMTP_HOST` | No | SMTP server for error summary emails | | `SMTP_PORT` | No | SMTP port (default: `587`) | | `SMTP_USERNAME` | No | SMTP username | | `SMTP_PASSWORD` | No | SMTP password | | `SMTP_SMTPSecure` | No | `tls` or `ssl` (default: `tls`) | | `SMTP_SENTFROM` | No | Sender email address | | `SMTP_ADDADDRESS` | No | Recipient email address | | `DEBUG` | No | Set to `1` for verbose scraper output | | `DRY_RUN` | No | Set to `1` to parse without writing to the DB | | `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output | | `ALLOW_INSECURE` | No | Set to `1` to skip SSL verification (use only for legacy council sites) | --- ## Running Scrapers Selectively Use `ONLY` or `SKIP` environment variables with `run_all.sh`. Values are comma-separated scraper names (filename without `.rb`). ```bash # Run only two councils ONLY=meandervalley,kentish docker compose run --rm scraper /app/run_all.sh # Run all except one SKIP=hobartcity docker compose run --rm scraper /app/run_all.sh ``` --- ## Council → Table Mapping Each scraper writes to its own `da_*` table. The table name is derived from the scraper filename. | Council | Scraper file | DB table | |---|---|---| | Break O'Day | `break_oday.rb` | `da_break_oday` | | Brighton | `brighton.rb` | `da_brighton` | | Burnie | `burnie.rb` | `da_burnie` | | Central Coast | `centralcoast.rb` | `da_centralcoast` | | Central Highlands | `centralhighlands.rb` | `da_centralhighlands` | | Circular Head | `circularhead.rb` | `da_circularhead` | | Clarence | `clarence.rb` | `da_clarence` | | Derwent Valley | `derwentvalley.rb` | `da_derwentvalley` | | Devonport | `devonportcity.rb` | `da_devonportcity` | | Dorset | `dorset.rb` | `da_dorset` | | Flinders | `flinders_council.rb` | `da_flinders_council` | | George Town | `georgetown.rb` | `da_georgetown` | | Glamorgan Spring Bay | `glamorgan.rb` | `da_glamorgan` | | Glenorchy | `glenorchy.rb` | `da_glenorchy` | | Hobart | `hobartcity.rb` | `da_hobartcity` | | Huon Valley | `huonvalley.rb` | `da_huonvalley` | | Kentish | `kentish.rb` | `da_kentish` | | Kingborough | `kingborough.rb` | `da_kingborough` | | Latrobe | `latrobe.rb` | `da_latrobe` | | Launceston | `launcestoncity.rb` | `da_launcestoncity` | | Meander Valley | `meandervalley.rb` | `da_meandervalley` | | Northern Midlands | `northernmidlands.rb` | `da_northernmidlands` | | Southern Midlands | `southernmidlands.rb` | `da_southernmidlands` | | Sorell | *(PlanBuild)* | `da_sorell` | | Tasman | `tasman.rb` | `da_tasman` | | Waratah–Wynyard | `waratah_wynyard.rb` | `da_waratah_wynyard` | | West Coast | `westcoast.rb` | `da_westcoast` | | West Tamar | `westtamar.rb` | `da_westtamar` | | Various (PlanBuild portal) | `planbuild.rb` | Per-council `da_*` tables | --- ## Database Schema Every `da_*` table shares the same base schema: | Column | Type | Notes | |---|---|---| | `id` | `BIGINT` | Auto-increment PK | | `council_reference` | `VARCHAR(100)` | DA reference number | | `address` | `VARCHAR(255)` | Street address | | `description` | `TEXT` | Proposal description | | `date_received` | `DATE` | Application date | | `on_notice_to` | `DATE` | Public comment close date | | `applicant` | `VARCHAR(255)` | | | `document_url` | `TEXT` | Remote PDF URL | | `local_document_url` | `TEXT` | Downloaded PDF path (served via `/files/`) | | `documents_json` | `MEDIUMTEXT` | JSON array of `{name, url, local_url}` — multi-doc DAs (e.g. Launceston) | | `address_std` | `VARCHAR(255)` | Google-normalised address | | `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates | | `property_id` | `TEXT` | Land title PID | | `title_reference` | `TEXT` | Certificate of title reference | | `application_type` | `VARCHAR(60)` | LLM-classified type (e.g. `Residential`, `Subdivision`) | | `application_type_raw` | `TEXT` | Raw LLM response (for auditing) | | `application_type_at` | `DATETIME` | When classification was last run | | `status` | `VARCHAR(100)` | Application status (Launceston eProperty) | | `assigned_officer` | `VARCHAR(255)` | Assigned planning officer (Launceston) | | `category` | `VARCHAR(100)` | Application category (Launceston) | | `application_valid` | `DATE` | Date application was deemed valid (Launceston) | | `advertised_on` | `DATE` | Date first advertised (Launceston) | | `property_legal_description` | `TEXT` | Certificate of title / legal description (Launceston) | | `created_at` / `updated_at` | `DATETIME` | | Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`, `document_url`) — the first value is kept on subsequent scrapes. --- ## Enrichment Pipeline After each upsert, `enrich_after_upsert!` runs two optional enrichment steps: 1. **Geocoding** (requires `GOOGLE_MAPS_API_KEY`) — calls the Google Maps Geocoding API, caches results in the `geo_cache` table, and populates `address_std`, `street`, `locality`, `state`, `postcode`, `lat`, `lng`. 2. **Property lookup** (requires `LOOKUP_URL`) — POSTs `{lat, lng}` to a property data service and populates `property_id` and `title_reference`. To run geocode backfill as a standalone pass over existing rows: ```bash # All tables docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \ scraper ruby /app/tools/backfill_geocode.rb # Single table docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \ -e ONLY_TABLE=da_dorset \ scraper ruby /app/tools/backfill_geocode.rb # Dry run docker compose run --rm \ -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \ -e ONLY_TABLE=da_dorset \ -e DRY_RUN=1 \ scraper ruby /app/tools/backfill_geocode.rb ``` --- ## WAF and Cloudflare Handling `lib/http.rb` sends a full Chrome browser fingerprint on every request, including `sec-ch-ua`, `Sec-Fetch-*`, and `Upgrade-Insecure-Requests` headers. This satisfies most WAF checks without any extra scraper code. For sites that additionally require a **warm cookie state**, the scraper does a proactive homepage GET before fetching the target URL. See `burnie.rb` for the reference implementation of this pattern (custom `CookieJar` class + `http_get_with_cookies`). Scrapers using this pattern: `burnie.rb`, `kingisland.rb`, `latrobe.rb`, `derwentvalley.rb`. **Cloudflare JS challenge** (the `"Just a moment..."` interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. These scrapers detect the challenge, log a warning, and exit cleanly. Where a PlanBuild equivalent exists, data is still collected via `planbuild.rb`. --- ## Adding a New Scraper 1. Create `scrapers/.rb` — use an existing simple scraper (e.g. `glamorgan.rb`) as a template. 2. At minimum the scraper must: - Read `TABLE = ENV.fetch("TABLE_NAME")` - Call `DB.ensure_table!(TABLE)` — all schema columns are already included - Call `DB.upsert(TABLE, row)` with at least `council_reference` and `address` - Call `enrich_after_upsert!` after each upsert 3. Add the council to `COUNCIL_MAP` in `lib/util.rb` if PlanBuild integration is needed. 4. Test locally: `TABLE_NAME=da_ ruby scrapers/.rb` --- ## PDF Classification (LLM) After PDFs are downloaded, `tools/classify_pdfs.rb` extracts text from each PDF using `pdftotext` and sends it to a local Ollama instance to classify the application type. **Application types:** Residential, Commercial, Industrial, Subdivision, Rural/Agriculture, Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use, Demolition, Signage, Other ```bash # Classify all unclassified PDFs (dry run first) docker compose run --rm -e DRY_RUN=1 scraper ruby /app/tools/classify_pdfs.rb # Run for real docker compose run --rm scraper ruby /app/tools/classify_pdfs.rb # Single council docker compose run --rm -e ONLY_TABLE=da_northernmidlands scraper ruby /app/tools/classify_pdfs.rb # Re-classify existing (overwrite) docker compose run --rm -e RECLASSIFY=1 scraper ruby /app/tools/classify_pdfs.rb # Use a different model docker compose run --rm -e LLM_MODEL=gemma3 scraper ruby /app/tools/classify_pdfs.rb ``` Results are written to `application_type`, `application_type_raw` (full LLM response for auditing), and `application_type_at` (timestamp). The web portal displays the type as a badge and supports filtering by type. --- ## Error Summary Emails When any scraper exits with an error, `run_all.sh` automatically calls `tools/send_summary_email.rb` to send an HTML summary email if `SMTP_HOST` is configured in `.env`. The email contains a colour-coded table of all scrapers with their saved counts and error status. --- ## Tools | Script | Purpose | |---|---| | `tools/backfill_geocode.rb` | Batch geocode backfill for existing rows (supports `ONLY_TABLE`, `DRY_RUN`) | | `tools/classify_pdfs.rb` | LLM classification of downloaded PDFs — sets `application_type` on each row | | `tools/send_summary_email.rb` | Sends HTML error-summary email via SMTP — called by `run_all.sh` on ERROR | | `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows | | `tools/import_sqlites.rb` | Import data from legacy SQLite exports | | `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal | --- ## Project Structure ``` tas_councils/ ├── lib/ │ ├── db.rb # DB connection, table creation, dynamic upsert logic │ ├── http.rb # HTTP client — browser-fingerprint headers, retries, WAF warmup, curl fallback │ ├── geocode.rb # Google Maps geocoding with SHA1 cache │ ├── enrich.rb # Post-upsert enrichment pipeline │ ├── util.rb # Date parsing, council/table name mappings │ ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich! │ ├── migrate.rb # Sequential schema migration runner │ └── llm.php # LLM inference helper for PHP (llama-swap + Ollama) ├── scrapers/ # One .rb file per council ├── web/ # PHP search portal (Apache) ├── tools/ # Standalone backfill and migration scripts ├── run_all.sh # Discovers and runs scrapers (supports ONLY/SKIP) ├── entrypoint.sh # Docker entrypoint; optionally loops on a schedule ├── Dockerfile # Ruby 3.2 scraper image ├── docker-compose.yml # Full stack: db, scraper, web, adminer └── .env # Secrets — never commit this file ```