No Description

Benjamin Harris 5494e79769 Fix Web Mount 2 months ago
.claude 7bb466cba8 Derwent Valley Update 2 months ago
config f999e2d2f4 Show URL 2 months ago
lib f979d88c48 Initial LLM Classifier 2 months ago
node_modules ab1179273d Inital Upload 2 months ago
old_sqlites ab1179273d Inital Upload 2 months ago
scrapers ee38435caa update to Dorset Scraper 2 months ago
test f3c06ab7ea SQL Injection Updates 2 months ago
tools f979d88c48 Initial LLM Classifier 2 months ago
web 5494e79769 Fix Web Mount 2 months ago
.gitattributes 86a770139f George Town Updates 2 months ago
.gitignore f6098e25d3 Updates 2 months ago
CLAUDE.md 3642c1be2a LLM Readme Update 2 months ago
Dockerfile f979d88c48 Initial LLM Classifier 2 months ago
Gemfile ab1179273d Inital Upload 2 months ago
Gemfile.lock ab1179273d Inital Upload 2 months ago
README.md 3642c1be2a LLM Readme Update 2 months ago
VERSIONS.md d19d144796 Readme Updates 2 months ago
ad.json ab1179273d Inital Upload 2 months ago
docker-compose.yml 5494e79769 Fix Web Mount 2 months ago
entrypoint.sh ab1179273d Inital Upload 2 months ago
migrate-compose.sh ab1179273d Inital Upload 2 months ago
package-lock.json ab1179273d Inital Upload 2 months ago
package.json ab1179273d Inital Upload 2 months ago
planbuild_fetch.js ab1179273d Inital Upload 2 months ago
run_all.sh 9ff5b421df Send Email 2 months ago

README.md

TAS Councils Planning Applications Scraper

A web scraping and data aggregation system for Tasmanian development applications (DAs). It collects planning application notices from all 29 Tasmanian council websites, normalises and geocodes the data, and exposes it via a PHP search portal.

See VERSIONS.md for the changelog.


Architecture

┌─────────────────────────────────────────────────────┐
│  29 Ruby scrapers  (scrapers/*.rb)                  │
│  Each polls one council website on a schedule       │
└──────────────────────┬──────────────────────────────┘
                       │ upserts rows
                       ▼
               MariaDB (da_* tables)
                       │
         ┌─────────────┴─────────────┐
         │                           │
   PHP web portal              Adminer UI
   (web/index.php)             port 9980
   port 9981

Services (Docker Compose):

Service Image Port Purpose
db mariadb:10.11 3306 Database
scraper Custom (Ruby 3.2) Runs all scrapers on a schedule
web Custom (PHP/Apache) 9981 Search portal
adminer adminer 9980 Database admin UI

Quick Start

1. Copy and configure environment

cp .env.example .env
# Edit .env — set DB passwords and your Google Maps API key

2. Start all services

docker compose up -d

3. Run scrapers manually (once)

docker compose run --rm scraper /app/run_all.sh

Environment Variables

Copy .env.example to .env and fill in the values. Never commit .env.

Variable Required Description
MYSQL_DATABASE Yes Database name (default: planning_scrapes)
MYSQL_USER Yes Database username
MYSQL_PASSWORD Yes Database password
MYSQL_ROOT_PASSWORD Yes MariaDB root password
GOOGLE_MAPS_API_KEY Yes Used to geocode DA addresses
LOOKUP_URL No URL of the property lookup service (PID/title enrichment)
LOOKUP_THROTTLE_MS No Milliseconds between lookup requests (default: 150)
SCRAPE_EVERY_MINUTES No If set, the scraper loops on this interval (default: run once)
DOWNLOAD_ATTACHMENTS No Set to 1 to download PDF attachments
DOWNLOAD_DIR No Host path for downloaded PDFs (default: /app/downloads)
LLAMA_URL No Base URL of local Ollama instance for PDF classification (default: http://192.168.8.73:11434)
LLM_MODEL No Ollama model name for PDF classification (default: llama3.2)
SMTP_HOST No SMTP server for error summary emails
SMTP_PORT No SMTP port (default: 587)
SMTP_USERNAME No SMTP username
SMTP_PASSWORD No SMTP password
SMTP_SMTPSecure No tls or ssl (default: tls)
SMTP_SENTFROM No Sender email address
SMTP_ADDADDRESS No Recipient email address
DEBUG No Set to 1 for verbose scraper output
DRY_RUN No Set to 1 to parse without writing to the DB
ENRICH_DEBUG No Set to 1 for verbose geocode/lookup output
ALLOW_INSECURE No Set to 1 to skip SSL verification (use only for legacy council sites)

Running Scrapers Selectively

Use ONLY or SKIP environment variables with run_all.sh. Values are comma-separated scraper names (filename without .rb).

# Run only two councils
ONLY=meandervalley,kentish docker compose run --rm scraper /app/run_all.sh

# Run all except one
SKIP=hobartcity docker compose run --rm scraper /app/run_all.sh

Council → Table Mapping

Each scraper writes to its own da_* table. The table name is derived from the scraper filename.

Council Scraper file DB table
Break O'Day break_oday.rb da_break_oday
Brighton brighton.rb da_brighton
Burnie burnie.rb da_burnie
Central Coast centralcoast.rb da_centralcoast
Central Highlands centralhighlands.rb da_centralhighlands
Circular Head circularhead.rb da_circularhead
Clarence clarence.rb da_clarence
Derwent Valley derwentvalley.rb da_derwentvalley
Devonport devonportcity.rb da_devonportcity
Dorset dorset.rb da_dorset
Flinders flinders_council.rb da_flinders_council
George Town georgetown.rb da_georgetown
Glamorgan Spring Bay glamorgan.rb da_glamorgan
Glenorchy glenorchy.rb da_glenorchy
Hobart hobartcity.rb da_hobartcity
Huon Valley huonvalley.rb da_huonvalley
Kentish kentish.rb da_kentish
Kingborough kingborough.rb da_kingborough
Latrobe latrobe.rb da_latrobe
Launceston launcestoncity.rb da_launcestoncity
Meander Valley meandervalley.rb da_meandervalley
Northern Midlands northernmidlands.rb da_northernmidlands
Southern Midlands southernmidlands.rb da_southernmidlands
Sorell (PlanBuild) da_sorell
Tasman tasman.rb da_tasman
Waratah–Wynyard waratah_wynyard.rb da_waratah_wynyard
West Coast westcoast.rb da_westcoast
West Tamar westtamar.rb da_westtamar
Various (PlanBuild portal) planbuild.rb Per-council da_* tables

Database Schema

Every da_* table shares the same base schema:

Column Type Notes
id BIGINT Auto-increment PK
council_reference VARCHAR(100) DA reference number
address VARCHAR(255) Street address
description TEXT Proposal description
date_received DATE Application date
on_notice_to DATE Public comment close date
applicant VARCHAR(255)
document_url TEXT Remote PDF URL
local_document_url TEXT Downloaded PDF path (served via /files/)
documents_json MEDIUMTEXT JSON array of {name, url, local_url} — multi-doc DAs (e.g. Launceston)
address_std VARCHAR(255) Google-normalised address
lat / lng DECIMAL(10,7) Geocoded coordinates
property_id TEXT Land title PID
title_reference TEXT Certificate of title reference
application_type VARCHAR(60) LLM-classified type (e.g. Residential, Subdivision)
application_type_raw TEXT Raw LLM response (for auditing)
application_type_at DATETIME When classification was last run
status VARCHAR(100) Application status (Launceston eProperty)
assigned_officer VARCHAR(255) Assigned planning officer (Launceston)
category VARCHAR(100) Application category (Launceston)
application_valid DATE Date application was deemed valid (Launceston)
advertised_on DATE Date first advertised (Launceston)
property_legal_description TEXT Certificate of title / legal description (Launceston)
created_at / updated_at DATETIME

Rows are upserted on (council_reference, address). Some fields are write-once (e.g. date_received, document_url) — the first value is kept on subsequent scrapes.


Enrichment Pipeline

After each upsert, enrich_after_upsert! runs two optional enrichment steps:

  1. Geocoding (requires GOOGLE_MAPS_API_KEY) — calls the Google Maps Geocoding API, caches results in the geo_cache table, and populates address_std, street, locality, state, postcode, lat, lng.

  2. Property lookup (requires LOOKUP_URL) — POSTs {lat, lng} to a property data service and populates property_id and title_reference.

To run geocode backfill as a standalone pass over existing rows:

# All tables
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
  scraper ruby /app/tools/backfill_geocode.rb

# Single table
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
  -e ONLY_TABLE=da_dorset \
  scraper ruby /app/tools/backfill_geocode.rb

# Dry run
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
  -e ONLY_TABLE=da_dorset \
  -e DRY_RUN=1 \
  scraper ruby /app/tools/backfill_geocode.rb

WAF and Cloudflare Handling

lib/http.rb sends a full Chrome browser fingerprint on every request, including sec-ch-ua, Sec-Fetch-*, and Upgrade-Insecure-Requests headers. This satisfies most WAF checks without any extra scraper code.

For sites that additionally require a warm cookie state, the scraper does a proactive homepage GET before fetching the target URL. See burnie.rb for the reference implementation of this pattern (custom CookieJar class + http_get_with_cookies). Scrapers using this pattern: burnie.rb, kingisland.rb, latrobe.rb, derwentvalley.rb.

Cloudflare JS challenge (the "Just a moment..." interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: derwentvalley.tas.gov.au, latrobe.tas.gov.au, kentish.tas.gov.au, centralhighlands.tas.gov.au. These scrapers detect the challenge, log a warning, and exit cleanly. Where a PlanBuild equivalent exists, data is still collected via planbuild.rb.


Adding a New Scraper

  1. Create scrapers/<councilname>.rb — use an existing simple scraper (e.g. glamorgan.rb) as a template.
  2. At minimum the scraper must:
    • Read TABLE = ENV.fetch("TABLE_NAME")
    • Call DB.ensure_table!(TABLE) — all schema columns are already included
    • Call DB.upsert(TABLE, row) with at least council_reference and address
    • Call enrich_after_upsert! after each upsert
  3. Add the council to COUNCIL_MAP in lib/util.rb if PlanBuild integration is needed.
  4. Test locally: TABLE_NAME=da_<name> ruby scrapers/<name>.rb

PDF Classification (LLM)

After PDFs are downloaded, tools/classify_pdfs.rb extracts text from each PDF using pdftotext and sends it to a local Ollama instance to classify the application type.

Application types: Residential, Commercial, Industrial, Subdivision, Rural/Agriculture, Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use, Demolition, Signage, Other

# Classify all unclassified PDFs (dry run first)
docker compose run --rm -e DRY_RUN=1 scraper ruby /app/tools/classify_pdfs.rb

# Run for real
docker compose run --rm scraper ruby /app/tools/classify_pdfs.rb

# Single council
docker compose run --rm -e ONLY_TABLE=da_northernmidlands scraper ruby /app/tools/classify_pdfs.rb

# Re-classify existing (overwrite)
docker compose run --rm -e RECLASSIFY=1 scraper ruby /app/tools/classify_pdfs.rb

# Use a different model
docker compose run --rm -e LLM_MODEL=gemma3 scraper ruby /app/tools/classify_pdfs.rb

Results are written to application_type, application_type_raw (full LLM response for auditing), and application_type_at (timestamp). The web portal displays the type as a badge and supports filtering by type.


Error Summary Emails

When any scraper exits with an error, run_all.sh automatically calls tools/send_summary_email.rb to send an HTML summary email if SMTP_HOST is configured in .env. The email contains a colour-coded table of all scrapers with their saved counts and error status.


Tools

Script Purpose
tools/backfill_geocode.rb Batch geocode backfill for existing rows (supports ONLY_TABLE, DRY_RUN)
tools/classify_pdfs.rb LLM classification of downloaded PDFs — sets application_type on each row
tools/send_summary_email.rb Sends HTML error-summary email via SMTP — called by run_all.sh on ERROR
tools/backfill_dorset_docs.rb Backfill PDF links for Dorset rows
tools/import_sqlites.rb Import data from legacy SQLite exports
planbuild_fetch.js Playwright-based scraper for the PlanBuild TAS portal

Project Structure

tas_councils/
├── lib/
│   ├── db.rb             # DB connection, table creation, dynamic upsert logic
│   ├── http.rb           # HTTP client — browser-fingerprint headers, retries, WAF warmup, curl fallback
│   ├── geocode.rb        # Google Maps geocoding with SHA1 cache
│   ├── enrich.rb         # Post-upsert enrichment pipeline
│   ├── util.rb           # Date parsing, council/table name mappings
│   ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
│   ├── migrate.rb        # Sequential schema migration runner
│   └── llm.php           # LLM inference helper for PHP (llama-swap + Ollama)
├── scrapers/             # One .rb file per council
├── web/                  # PHP search portal (Apache)
├── tools/                # Standalone backfill and migration scripts
├── run_all.sh            # Discovers and runs scrapers (supports ONLY/SKIP)
├── entrypoint.sh         # Docker entrypoint; optionally loops on a schedule
├── Dockerfile            # Ruby 3.2 scraper image
├── docker-compose.yml    # Full stack: db, scraper, web, adminer
└── .env                  # Secrets — never commit this file