This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose.
| File | Role |
|---|---|
lib/db.rb |
DB client, ensure_table!, upsert (dynamic columns, write-once semantics) |
lib/http.rb |
HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback |
lib/geocode.rb |
Google Maps geocoding with SHA1 cache in geo_cache table |
lib/enrich.rb |
enrich_after_upsert! — geocoding + property lookup after each DB write |
lib/util.rb |
parse_aus_date, council-name/table-name mappings |
lib/scraper_helpers.rb |
Shared helpers: abs_url, text_or, upsert_and_enrich! |
lib/migrate.rb |
Sequential schema migration runner — add new migrations at end of MIGRATIONS array |
lib/llm.php |
LLM inference helper for PHP — calls Ollama-compatible API (llama-swap primary, Ollama fallback) |
run_all.sh |
Discovers scrapers/*.rb, filters by ONLY/SKIP, runs each with TABLE_NAME set; prints summary table; emails on error |
entrypoint.sh |
Docker entry; waits for DB then runs run_all.sh (looping if SCRAPE_EVERY_MINUTES is set) |
scrapers/*.rb |
One scraper per council — parses HTML, upserts rows, calls enrich_after_upsert! |
web/index.php |
Search portal — dynamic UNION across all da_* tables |
tools/send_summary_email.rb |
Sends HTML error-summary email via SMTP (called by run_all.sh when any scraper ERRORs) |
tools/backfill_geocode.rb |
Batch geocode backfill for existing rows (supports ONLY_TABLE, DRY_RUN) |
# Full stack
docker compose up -d
# Run all scrapers once
docker compose run --rm scraper /app/run_all.sh
# Run a single scraper
TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb
# Run a subset
ONLY=meandervalley,westtamar docker compose run --rm scraper /app/run_all.sh
# Geocode backfill (batch, all tables)
docker compose run --rm \
-e GOOGLE_MAPS_API_KEY="..." \
scraper ruby /app/tools/backfill_geocode.rb
# Geocode backfill (single table)
docker compose run --rm \
-e GOOGLE_MAPS_API_KEY="..." \
-e ONLY_TABLE=da_brighton \
scraper ruby /app/tools/backfill_geocode.rb
TABLE = ENV.fetch("TABLE_NAME") — set by run_all.sh from the filenameDB.ensure_table!(TABLE) — idempotent schema setup (all columns already included)Http.get(url) (handles retries, cookies, WAF warmup, browser-fingerprint headers)DB.upsert(TABLE, row) — upserts on (council_reference, address), write-once for date_receivedenrich_after_upsert!(table:, council_reference:, address:) — geocodes and enricheslib/http.rb sends a full browser fingerprint on every request: User-Agent, sec-ch-ua*, Sec-Fetch-*, Upgrade-Insecure-Requests. This satisfies most WAF header checks automatically.burnie.rb as the reference implementation.COUNCIL_MAP), data is still collected via planbuild.rb.CookieJar + http_get with redirect handling) is self-contained in scrapers that need it and does not depend on lib/http.rb.DOWNLOAD_ATTACHMENTS=1 (set in docker-compose.yml or at runtime)DOWNLOAD_DIR/<councilname>/<ref>/filename.pdf inside the container/srv/files and Apache serves it via Alias /files /srv/fileslocal_document_url must be stored as /files/<councilname>/... — not /downloads/.... The Apache alias is /files, not /downloads.local_document_url over document_url when rendering the document buttondocuments_json and rendered as a list of buttons in the portalDB.upsert)date_received — never overwritten once setdate_received_raw — never overwritten once non-blankdocument_url / local_document_url — new value only replaces if existing is NULLscrapers/foo.rb → da_foorun_all.sh sets TABLE_NAME=da_<basename> before invoking each scraperCOUNCIL_MAP in lib/util.rb maps internal council keys to table names (used by PlanBuild integration)ok, warn, blocked (Cloudflare), ERROR (non-zero exit)"Saved N" (case-insensitive) first, falls back to counting "Upserted" linesputs "Done #{TABLE}. Saved #{n} item(s)." for correct summary parsingSMTP_HOST is set, tools/send_summary_email.rb sends an HTML summary emailURI.join, URI.parse) → rescue URI::InvalidURIErrorrescue Mysql2::Error => e; Log.warn ...rescue Zlib::ErrorDate.strptime, Date.parse) → rescue ArgumentError, Date::Errorrescue JSON::ParserErrorrescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFErrorwarn to stderr — do not gate them behind ENRICH_DEBUGrescue — always specify the exception class(es)When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
Found 0 rows — CSS selector no longer matches; inspect the live page and update the selectorHttp.get options or add a proactive warmup step (see burnie.rb)"Just a moment" in body) — cannot be solved in Ruby; exit cleanly with a warningdate_received all nil — Date format changed; update the format string passed to Util.parse_aus_date or Date.strptimeTemplate choice:
glamorgan.rb<h2> headings → copy northernmidlands.rb<h2> with labeled <strong> fields + PDF in <ul> → copy westtamar.rbcentralhighlands.rbkingisland.rb (minimal) or burnie.rb (full-featured with PDF download)derwentvalley.rbThe shared infrastructure (Http, DB, enrich_after_upsert!) handles everything else.
utf8mb4 encoding throughoutCREATE TABLE IF NOT EXISTS + ALTER TABLE ... ADD COLUMN IF NOT EXISTSlib/migrate.rb (new migration at end of MIGRATIONS array) or lib/db.rb (ensure_table!) for columns every new table getsgeo_cache table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API callsUNIQUE KEY uniq_ref_addr (council_reference, address) constraint drives the upsert behaviour| Column | Type | Notes |
|---|---|---|
documents_json |
MEDIUMTEXT | JSON array of {name, url, local_url} — used when a DA has multiple PDFs (e.g. Launceston) |
status |
VARCHAR(100) | Application status text (Launceston eProperty) |
assigned_officer |
VARCHAR(255) | Assigned planning officer (Launceston) |
group |
VARCHAR(100) | Application group (Launceston) — reserved SQL word, always quoted |
category |
VARCHAR(100) | Application category (Launceston) |
application_valid |
DATE | Date application deemed valid (Launceston) |
advertised_on |
DATE | Date first advertised (Launceston) |
property_legal_description |
TEXT | Certificate of Title / legal description (Launceston) |
web/index.php dynamically discovers all da_* tables and builds a UNION querydocuments_json is present → renders a button per document using the name from JSON; otherwise falls back to single "Open document" button using local_document_url → document_urlweb/backfill_pid_title.php is a legacy admin tool — it should not be publicly accessibletools/send_summary_email.rb is called by run_all.sh when any scraper exits with ERROR status. It:
SMTP_HOST, SMTP_PORT, SMTP_USERNAME, SMTP_PASSWORD, SMTP_SMTPSecure (tls/ssl), SMTP_SENTFROM, SMTP_ADDADDRESSnet/smtp — no gems requiredSMTP_HOST is not setTABLE constant conflicts: Each scraper defines TABLE = ENV.fetch("TABLE_NAME") at the top level. If you require two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.COUNCIL_FILTER / COUNCIL_WHITELIST: The docker-compose.yml has a COUNCIL_WHITELIST env var that is passed to the scraper container but is not wired into run_all.sh. Use ONLY / SKIP in run_all.sh instead.planbuild.rb handles councils on the state-run PlanBuild portal. It writes to per-council tables using Util.ref_to_table. These run alongside the council-specific scrapers.local_document_url must begin with /files/ (not /downloads/). The Apache alias in web/000-files.conf is Alias /files /srv/files. Using /downloads/ results in 404 in the web portal.headers: { "Accept" => "application/pdf,*/*", "Referer" => URL } to Http.get when downloading PDFs from CDN subdomains — some CDNs reject requests without a valid referrer.–) directly in PDF filenames. Always percent-encode hrefs before passing to URI.join — see burnie.rb first_pdf_on_detail for the pattern.Net::HTTP.start blocks: next inside a Net::HTTP.start block exits the block, not the enclosing while loop. Use a redirect_to variable set inside the block and call next on the while loop after the block returns — see burnie.rb http_get_with_cookies."Just a moment") may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: derwentvalley.tas.gov.au, latrobe.tas.gov.au.group column: This is a reserved SQL word. In DB.upsert it is safe because all column names are backtick-quoted. In raw SQL always write `group`.Extract structured information from downloaded DA PDFs using a local LLaMA model — primarily application type (Residential, Commercial, Industrial, Subdivision, etc.) but potentially other fields not reliably scraped from HTML (e.g. lot size, number of dwellings, value of works).
A local Ollama instance is running at http://192.168.8.73:11434 (env var: LLAMA_URL).
lib/llm.php (already in the repo) shows the integration pattern for PHP:
/v1/chat/completions/api/generateconfig/ai.php — LLAMACPP_HOST, OLLAMA_HOST, LLAMACPP_MODEL, OLLAMA_MODEL, etc.For the Ruby scraper pipeline the equivalent is a direct Ollama HTTP call (no gems needed — stdlib net/http):
# Minimal Ollama call — POST to /api/generate
require "net/http"
require "json"
def llm_classify(text, model: "llama3.2")
uri = URI("#{ENV.fetch('LLAMA_URL', 'http://192.168.8.73:11434')}/api/generate")
body = JSON.generate(model: model, prompt: text, stream: false)
res = Net::HTTP.post(uri, body, "Content-Type" => "application/json")
JSON.parse(res.body)["response"].to_s.strip
rescue StandardError => e
warn "[llm] #{e.class}: #{e.message}"
nil
end
Downloaded PDF (local_document_url)
│
▼
Extract text (pdftotext CLI or pdf-reader gem)
│
▼
Prompt LLM → application_type string
│
▼
DB.upsert / UPDATE da_* SET application_type = ?
You are classifying a Tasmanian planning development application.
Read the following text and return ONLY the single most appropriate
application type from this list:
Residential, Commercial, Industrial, Subdivision, Rural/Agriculture,
Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use,
Demolition, Signage, Other
Text:
<first 1500 characters of PDF text>
Reply with the type only. No explanation.
-- Add to ensure_table! and as a new migration:
application_type VARCHAR(60) NULL -- e.g. "Residential", "Subdivision"
application_type_raw TEXT NULL -- full LLM response for debugging
application_type_at DATETIME NULL -- when classification was last run
Option A — Inline during scrape (simplest):
llm_classify immediately after downloadOption B — Backfill tool (recommended):
tools/classify_pdfs.rb — iterates rows where local_document_url IS NOT NULL AND application_type IS NULLrun_all.sh, on demand or on a cronONLY_TABLE env var to process one council at a timeOption C — PHP tool in web container:
tools/classify_pdfs.php using the existing lib/llm.php/srv/files, calls llmGenerate, updates DBpdftotext shell call or a PHP PDF lib)pdftotext (part of poppler-utils) is the most reliable option:
def extract_pdf_text(local_path, max_chars: 2000)
# local_path is relative like "/files/northernmidlands/PLN-26-0030/doc.pdf"
# Map to filesystem path inside container
fs_path = local_path.sub(%r{\A/files/}, "#{ENV.fetch('DOWNLOAD_DIR', '/app/downloads')}/")
return nil unless File.exist?(fs_path)
text, = Open3.capture2("pdftotext", "-l", "3", fs_path, "-")
text.to_s.gsub(/\s+/, " ").strip[0, max_chars]
rescue StandardError => e
warn "[classify] pdftotext failed for #{fs_path}: #{e.message}"
nil
end
pdftotext may need to be installed in the scraper Dockerfile:
RUN apt-get install -y poppler-utils
curl http://192.168.8.73:11434/api/tags)application_type values be overwritten on re-run, or treated as write-once?poppler-utils can be added to the scraper image