CLAUDE.md 17 KB

CLAUDE.md — Project Guide for Claude Code

What This Project Does

This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose.


Key Files

File Role
lib/db.rb DB client, ensure_table!, upsert (dynamic columns, write-once semantics)
lib/http.rb HTTP client — retries, cookie jar, browser-fingerprint headers, 403/406 warmup, curl fallback
lib/geocode.rb Google Maps geocoding with SHA1 cache in geo_cache table
lib/enrich.rb enrich_after_upsert! — geocoding + property lookup after each DB write
lib/util.rb parse_aus_date, council-name/table-name mappings
lib/scraper_helpers.rb Shared helpers: abs_url, text_or, upsert_and_enrich!
lib/migrate.rb Sequential schema migration runner — add new migrations at end of MIGRATIONS array
lib/llm.php LLM inference helper for PHP — calls Ollama-compatible API (llama-swap primary, Ollama fallback)
run_all.sh Discovers scrapers/*.rb, filters by ONLY/SKIP, runs each with TABLE_NAME set; prints summary table; emails on error
entrypoint.sh Docker entry; waits for DB then runs run_all.sh (looping if SCRAPE_EVERY_MINUTES is set)
scrapers/*.rb One scraper per council — parses HTML, upserts rows, calls enrich_after_upsert!
web/index.php Search portal — dynamic UNION across all da_* tables
tools/send_summary_email.rb Sends HTML error-summary email via SMTP (called by run_all.sh when any scraper ERRORs)
tools/classify_pdfs.rb LLM PDF classification backfill — sets application_type on rows with a downloaded PDF
tools/backfill_geocode.rb Batch geocode backfill for existing rows (supports ONLY_TABLE, DRY_RUN)

Running Things Locally

# Full stack
docker compose up -d

# Run all scrapers once
docker compose run --rm scraper /app/run_all.sh

# Run a single scraper
TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb

# Run a subset
ONLY=meandervalley,westtamar docker compose run --rm scraper /app/run_all.sh

# Geocode backfill (batch, all tables)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  scraper ruby /app/tools/backfill_geocode.rb

# Geocode backfill (single table)
docker compose run --rm \
  -e GOOGLE_MAPS_API_KEY="..." \
  -e ONLY_TABLE=da_brighton \
  scraper ruby /app/tools/backfill_geocode.rb

Architecture Conventions

Each scraper follows this pattern

  1. TABLE = ENV.fetch("TABLE_NAME") — set by run_all.sh from the filename
  2. DB.ensure_table!(TABLE) — idempotent schema setup (all columns already included)
  3. Fetch HTML via Http.get(url) (handles retries, cookies, WAF warmup, browser-fingerprint headers)
  4. Parse with Nokogiri
  5. DB.upsert(TABLE, row) — upserts on (council_reference, address), write-once for date_received
  6. enrich_after_upsert!(table:, council_reference:, address:) — geocodes and enriches

WAF / Cloudflare handling

  • lib/http.rb sends a full browser fingerprint on every request: User-Agent, sec-ch-ua*, Sec-Fetch-*, Upgrade-Insecure-Requests. This satisfies most WAF header checks automatically.
  • For sites that also need a warm cookie state (e.g. Burnie, King Island, Latrobe, Derwent Valley), the scraper implements a proactive homepage warmup before fetching the target page — see burnie.rb as the reference implementation.
  • Some councils (Kentish, Derwent Valley via direct site) use Cloudflare JS challenge which cannot be solved without a real browser. These exit cleanly with a warning. Where a PlanBuild equivalent exists (council code in COUNCIL_MAP), data is still collected via planbuild.rb.
  • The warmup pattern (custom CookieJar + http_get with redirect handling) is self-contained in scrapers that need it and does not depend on lib/http.rb.

PDF Downloads

  • Only happen when DOWNLOAD_ATTACHMENTS=1 (set in docker-compose.yml or at runtime)
  • Files land in DOWNLOAD_DIR/<councilname>/<ref>/filename.pdf inside the container
  • The web container mounts the same folder at /srv/files and Apache serves it via Alias /files /srv/files
  • local_document_url must be stored as /files/<councilname>/... — not /downloads/.... The Apache alias is /files, not /downloads.
  • The web portal prefers local_document_url over document_url when rendering the document button
  • For multi-document DAs (e.g. Launceston), all docs are stored as JSON in documents_json and rendered as a list of buttons in the portal

Write-once fields (in DB.upsert)

  • date_received — never overwritten once set
  • date_received_raw — never overwritten once non-blank
  • document_url / local_document_url — new value only replaces if existing is NULL

Table names

  • Always derived from the scraper filename: scrapers/foo.rbda_foo
  • run_all.sh sets TABLE_NAME=da_<basename> before invoking each scraper
  • The COUNCIL_MAP in lib/util.rb maps internal council keys to table names (used by PlanBuild integration)

run_all.sh summary table

  • After all scrapers finish, prints a formatted table: Council | Saved | Warns | Status
  • Status values: ok, warn, blocked (Cloudflare), ERROR (non-zero exit)
  • Saved count: parsed from scraper stdout — looks for "Saved N" (case-insensitive) first, falls back to counting "Upserted" lines
  • All scrapers should end with puts "Done #{TABLE}. Saved #{n} item(s)." for correct summary parsing
  • If any scraper has ERROR status and SMTP_HOST is set, tools/send_summary_email.rb sends an HTML summary email

Error Handling Conventions

  • URI building (URI.join, URI.parse) → rescue URI::InvalidURIError
  • DB operations (prepare/execute) → rescue Mysql2::Error => e; Log.warn ...
  • Zlib decompressionrescue Zlib::Error
  • Date parsing (Date.strptime, Date.parse) → rescue ArgumentError, Date::Error
  • JSON parsingrescue JSON::ParserError
  • Network/HTTPrescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFError
  • Enrichment failures always warn to stderr — do not gate them behind ENRICH_DEBUG
  • No bare rescue — always specify the exception class(es)

Adding or Modifying a Scraper

When a council changes its website markup, only that scraper needs updating. The typical failure mode is:

  • Found 0 rows — CSS selector no longer matches; inspect the live page and update the selector
  • HTTP 403/406 — Council site added WAF; check Http.get options or add a proactive warmup step (see burnie.rb)
  • Cloudflare JS challenge ("Just a moment" in body) — cannot be solved in Ruby; exit cleanly with a warning
  • date_received all nil — Date format changed; update the format string passed to Util.parse_aus_date or Date.strptime

Template choice:

  • Simple HTML list/table with one entry per row → copy glamorgan.rb
  • Single page, entries grouped under <h2> headings → copy northernmidlands.rb
  • Single page, entries under <h2> with labeled <strong> fields + PDF in <ul> → copy westtamar.rb
  • Link/PDF listing → copy centralhighlands.rb
  • WAF-protected site needing homepage warmup → copy kingisland.rb (minimal) or burnie.rb (full-featured with PDF download)
  • Multi-hop redirect to detail pages → copy derwentvalley.rb

The shared infrastructure (Http, DB, enrich_after_upsert!) handles everything else.


Database Notes

  • MariaDB 10.11, utf8mb4 encoding throughout
  • Schema is created on-the-fly — CREATE TABLE IF NOT EXISTS + ALTER TABLE ... ADD COLUMN IF NOT EXISTS
  • Schema changes go in lib/migrate.rb (new migration at end of MIGRATIONS array) or lib/db.rb (ensure_table!) for columns every new table gets
  • The geo_cache table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API calls
  • The UNIQUE KEY uniq_ref_addr (council_reference, address) constraint drives the upsert behaviour
  • Current migration versions: v1 (enrichment/geocode columns), v2 (geo_cache table), v3 (documents_json), v4 (Launceston detail columns), v5 (rewrite /downloads/ → /files/ in local_document_url), v6 (LLM classification columns)

Schema — notable columns added beyond base

Column Type Notes
documents_json MEDIUMTEXT JSON array of {name, url, local_url} — used when a DA has multiple PDFs (e.g. Launceston)
status VARCHAR(100) Application status text (Launceston eProperty)
assigned_officer VARCHAR(255) Assigned planning officer (Launceston)
group VARCHAR(100) Application group (Launceston) — reserved SQL word, always quoted
category VARCHAR(100) Application category (Launceston)
application_valid DATE Date application deemed valid (Launceston)
advertised_on DATE Date first advertised (Launceston)
property_legal_description TEXT Certificate of Title / legal description (Launceston)

Web Portal Notes

  • web/index.php dynamically discovers all da_* tables and builds a UNION query
  • It handles missing columns gracefully (not all tables have every column)
  • Document display: if documents_json is present → renders a button per document using the name from JSON; otherwise falls back to single "Open document" button using local_document_urldocument_url
  • web/backfill_pid_title.php is a legacy admin tool — it should not be publicly accessible

Email Summary

tools/send_summary_email.rb is called by run_all.sh when any scraper exits with ERROR status. It:

  • Reads SMTP config from env vars: SMTP_HOST, SMTP_PORT, SMTP_USERNAME, SMTP_PASSWORD, SMTP_SMTPSecure (tls/ssl), SMTP_SENTFROM, SMTP_ADDADDRESS
  • Uses Ruby stdlib net/smtp — no gems required
  • Sends multipart (plain + HTML) email with colour-coded summary table
  • Silently skips if SMTP_HOST is not set

Common Gotchas

  • TABLE constant conflicts: Each scraper defines TABLE = ENV.fetch("TABLE_NAME") at the top level. If you require two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
  • COUNCIL_FILTER env var: Used only by scrapers/hobartcity.rb to filter which councils to scrape from the Hobart eProperty portal. It has no effect on any other scraper or on run_all.sh. To run a subset of scrapers, use ONLY / SKIP in run_all.sh.
  • PlanBuild scrapers: planbuild.rb handles councils on the state-run PlanBuild portal. It writes to per-council tables using Util.ref_to_table. These run alongside the council-specific scrapers.
  • PDF download path: local_document_url must begin with /files/ (not /downloads/). The Apache alias in web/000-files.conf is Alias /files /srv/files. Using /downloads/ results in 404 in the web portal.
  • Binary PDF downloads: Pass headers: { "Accept" => "application/pdf,*/*", "Referer" => URL } to Http.get when downloading PDFs from CDN subdomains — some CDNs reject requests without a valid referrer.
  • Non-ASCII in PDF URLs: Some council sites embed Unicode characters (e.g. en-dash ) directly in PDF filenames. Always percent-encode hrefs before passing to URI.join — see burnie.rb first_pdf_on_detail for the pattern.
  • Redirect loops in Net::HTTP.start blocks: next inside a Net::HTTP.start block exits the block, not the enclosing while loop. Use a redirect_to variable set inside the block and call next on the while loop after the block returns — see burnie.rb http_get_with_cookies.
  • Cloudflare JS challenge vs IP block: A JS challenge ("Just a moment") may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: derwentvalley.tas.gov.au, latrobe.tas.gov.au, kentish.tas.gov.au, centralhighlands.tas.gov.au. Where a PlanBuild equivalent exists (council code in COUNCIL_MAP), data is still collected via planbuild.rb.
  • group column: This is a reserved SQL word. In DB.upsert it is safe because all column names are backtick-quoted. In raw SQL always write `group`.

Next Phase — LLM-Based PDF Classification

Goal

Extract structured information from downloaded DA PDFs using a local LLaMA model — primarily application type (Residential, Commercial, Industrial, Subdivision, etc.) but potentially other fields not reliably scraped from HTML (e.g. lot size, number of dwellings, value of works).

LLM Infrastructure

A local Ollama instance is running at http://192.168.8.73:11434 (env var: LLAMA_URL).

lib/llm.php (already in the repo) shows the integration pattern for PHP:

  • Primary backend: llama-swap via OpenAI-compatible /v1/chat/completions
  • Fallback: Ollama /api/generate
  • Config loaded from config/ai.phpLLAMACPP_HOST, OLLAMA_HOST, LLAMACPP_MODEL, OLLAMA_MODEL, etc.

For the Ruby scraper pipeline the equivalent is a direct Ollama HTTP call (no gems needed — stdlib net/http):

# Minimal Ollama call — POST to /api/generate
require "net/http"
require "json"

def llm_classify(text, model: "llama3.2")
  uri  = URI("#{ENV.fetch('LLAMA_URL', 'http://192.168.8.73:11434')}/api/generate")
  body = JSON.generate(model: model, prompt: text, stream: false)
  res  = Net::HTTP.post(uri, body, "Content-Type" => "application/json")
  JSON.parse(res.body)["response"].to_s.strip
rescue StandardError => e
  warn "[llm] #{e.class}: #{e.message}"
  nil
end

Proposed Pipeline

Downloaded PDF (local_document_url)
    │
    ▼
Extract text (pdftotext CLI or pdf-reader gem)
    │
    ▼
Prompt LLM → application_type string
    │
    ▼
DB.upsert / UPDATE da_* SET application_type = ?

Suggested Prompt

You are classifying a Tasmanian planning development application.
Read the following text and return ONLY the single most appropriate
application type from this list:
  Residential, Commercial, Industrial, Subdivision, Rural/Agriculture,
  Tourism/Visitor Accommodation, Outbuilding/Shed, Change of Use,
  Demolition, Signage, Other

Text:
<first 1500 characters of PDF text>

Reply with the type only. No explanation.

Schema Changes Needed

-- Add to ensure_table! and as a new migration:
application_type VARCHAR(60) NULL   -- e.g. "Residential", "Subdivision"
application_type_raw TEXT NULL      -- full LLM response for debugging
application_type_at DATETIME NULL   -- when classification was last run

Implementation Options

Option A — Inline during scrape (simplest):

  • Each scraper that downloads PDFs calls llm_classify immediately after download
  • Adds latency to each scrape run (LLM inference per PDF)
  • Suitable if the LLM is fast (< 5s per classification)

Option B — Backfill tool (recommended):

  • New script tools/classify_pdfs.rb — iterates rows where local_document_url IS NOT NULL AND application_type IS NULL
  • Run separately from run_all.sh, on demand or on a cron
  • Supports ONLY_TABLE env var to process one council at a time
  • Safer — scrape failures don't block classification; can re-run without re-scraping

Option C — PHP tool in web container:

  • New tools/classify_pdfs.php using the existing lib/llm.php
  • Reads PDFs from /srv/files, calls llmGenerate, updates DB
  • Advantage: reuses the already-written PHP LLM helper
  • Disadvantage: PDF text extraction harder in PHP (needs pdftotext shell call or a PHP PDF lib)

PDF Text Extraction

pdftotext (part of poppler-utils) is the most reliable option:

def extract_pdf_text(local_path, max_chars: 2000)
  # local_path is relative like "/files/northernmidlands/PLN-26-0030/doc.pdf"
  # Map to filesystem path inside container
  fs_path = local_path.sub(%r{\A/files/}, "#{ENV.fetch('DOWNLOAD_DIR', '/app/downloads')}/")
  return nil unless File.exist?(fs_path)

  text, = Open3.capture2("pdftotext", "-l", "3", fs_path, "-")
  text.to_s.gsub(/\s+/, " ").strip[0, max_chars]
rescue StandardError => e
  warn "[classify] pdftotext failed for #{fs_path}: #{e.message}"
  nil
end

pdftotext may need to be installed in the scraper Dockerfile:

RUN apt-get install -y poppler-utils

Implementation

  • Approach: Option B (backfill tool) — tools/classify_pdfs.rb runs independently of run_all.sh
  • Model: llama3.2 (3B, fast) by default; override with LLM_MODEL env var
  • Prompt: Zero-shot, plain-text response (no JSON overhead for a fixed classification list)
  • Raw response: Always stored in application_type_raw for auditing
  • Re-classification: Write-once by default; set RECLASSIFY=1 to overwrite
  • PDF extraction: pdftotext -l 3 (first 3 pages, from poppler-utils in Dockerfile)
  • Qwen3 note: The tool strips <think>...</think> tags from models that produce reasoning output before normalising the response