This is a scraping pipeline that collects Tasmanian planning development applications (DAs) from 29 council websites, geocodes them, and serves them via a PHP search portal. The code is a mix of Ruby (scrapers, enrichment), PHP (web portal), and shell (orchestration), all wired together with Docker Compose.
| File | Role |
|---|---|
lib/db.rb |
DB client, ensure_table!, upsert (with write-once semantics for some fields) |
lib/http.rb |
HTTP client — retries, cookie jar, 403/406 warmup, curl fallback |
lib/geocode.rb |
Google Maps geocoding with SHA1 cache in geo_cache table |
lib/enrich.rb |
enrich_after_upsert! — geocoding + property lookup after each DB write |
lib/util.rb |
parse_aus_date, council-name/table-name mappings |
run_all.sh |
Discovers scrapers/*.rb, filters by ONLY/SKIP, runs each with TABLE_NAME set |
entrypoint.sh |
Docker entry; waits for DB then runs run_all.sh (looping if SCRAPE_EVERY_MINUTES is set) |
scrapers/*.rb |
One scraper per council — parses HTML, upserts rows, calls enrich_after_upsert! |
web/index.php |
Search portal — dynamic UNION across all da_* tables |
# Full stack
docker compose up -d
# Run all scrapers once
docker compose run --rm scraper /app/run_all.sh
# Run a single scraper
TABLE_NAME=da_brighton DEBUG=1 ruby scrapers/brighton.rb
# Run a subset
ONLY=meandervalley,kent docker compose run --rm scraper /app/run_all.sh
# Geocode backfill (batch, all tables)
docker compose run --rm \
-e GOOGLE_MAPS_API_KEY="..." \
scraper ruby /app/tools/backfill_geocode.rb
# Geocode backfill (single table)
docker compose run --rm \
-e GOOGLE_MAPS_API_KEY="..." \
-e ONLY_TABLE=da_brighton \
scraper ruby /app/tools/backfill_geocode.rb
TABLE = ENV.fetch("TABLE_NAME") — set by run_all.sh from the filenameDB.ensure_table!(TABLE) — idempotent schema setup (all columns already included)Http.get(url) (handles retries, cookies, WAF warmup)DB.upsert(TABLE, row) — upserts on (council_reference, address), write-once for date_receivedenrich_after_upsert!(table:, council_reference:, address:) — geocodes and enrichesDB.upsert):date_received — never overwritten once setdate_received_raw — never overwritten once non-blankdocument_url / local_document_url — new value only replaces if existing is NULLscrapers/foo.rb → da_foorun_all.sh sets TABLE_NAME=da_<basename> before invoking each scraperCOUNCIL_MAP in lib/util.rb maps internal council keys to table names (used by PlanBuild integration)After a refactor, the project follows these rules:
URI.join, URI.parse) → rescue URI::InvalidURIErrorrescue Mysql2::Error => e; warn "[scraper] ..."rescue Zlib::ErrorDate.strptime, Date.parse) → rescue ArgumentError, Date::Errorrescue JSON::ParserErrorrescue Net::HTTPError, Net::ReadTimeout, Net::OpenTimeout, OpenSSL::SSL::SSLError, Errno::ECONNRESET, EOFErrorwarn to stderr — do not gate them behind ENRICH_DEBUGrescue — always specify the exception class(es)When a council changes its website markup, only that scraper needs updating. The typical failure mode is:
Found 0 rows — CSS selector no longer matches; inspect the live page and update the selectorHttp.get options or add a warmup stepdate_received all nil — Date format changed; update the format string passed to Util.parse_aus_date or Date.strptimeTo add a new scraper, copy a structurally similar one (e.g. glamorgan.rb for table-based sites, centralhighlands.rb for link/PDF-based sites) and adapt the parsing logic. The shared infrastructure (Http, DB, enrich_after_upsert!) handles everything else.
utf8mb4 encoding throughoutCREATE TABLE IF NOT EXISTS + ALTER TABLE ... ADD COLUMN IF NOT EXISTSlib/migrate.rb (new migration at end of MIGRATIONS array) or lib/db.rb (ensure_table!) for columns every new table getsgeo_cache table stores geocoding results keyed by SHA1 of the normalised query string — avoids redundant Google API callsUNIQUE KEY uniq_ref_addr (council_reference, address) constraint drives the upsert behaviourweb/index.php dynamically discovers all da_* tables and builds a UNION queryweb/backfill_pid_title.php is a legacy admin tool — it should not be publicly accessible; consider moving it out of the web root or placing it behind authenticationTABLE constant conflicts: Each scraper defines TABLE = ENV.fetch("TABLE_NAME") at the top level. If you require two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.COUNCIL_FILTER / COUNCIL_WHITELIST: The docker-compose.yml has a COUNCIL_WHITELIST env var that is passed to the scraper container but is not wired into run_all.sh. Use ONLY / SKIP in run_all.sh instead.planbuild.rb and planbuild_fetch.js handle councils on the state-run PlanBuild portal. They write to per-council tables using Util.ref_to_table. These are separate from the council-specific scrapers.DOWNLOAD_ATTACHMENTS=1. Files land in DOWNLOAD_DIR/<councilname>/. The web portal serves them from /downloads/ via an Apache alias.