|
|
vor 2 Monaten | |
|---|---|---|
| .claude | vor 2 Monaten | |
| lib | vor 2 Monaten | |
| node_modules | vor 2 Monaten | |
| old_sqlites | vor 2 Monaten | |
| scrapers | vor 2 Monaten | |
| test | vor 2 Monaten | |
| tools | vor 2 Monaten | |
| web | vor 2 Monaten | |
| .gitattributes | vor 2 Monaten | |
| .gitignore | vor 2 Monaten | |
| CLAUDE.md | vor 2 Monaten | |
| Dockerfile | vor 2 Monaten | |
| Gemfile | vor 2 Monaten | |
| Gemfile.lock | vor 2 Monaten | |
| README.md | vor 2 Monaten | |
| VERSIONS.md | vor 2 Monaten | |
| ad.json | vor 2 Monaten | |
| docker-compose.yml | vor 2 Monaten | |
| entrypoint.sh | vor 2 Monaten | |
| migrate-compose.sh | vor 2 Monaten | |
| package-lock.json | vor 2 Monaten | |
| package.json | vor 2 Monaten | |
| planbuild_fetch.js | vor 2 Monaten | |
| run_all.sh | vor 2 Monaten |
A web scraping and data aggregation system for Tasmanian development applications (DAs). It collects planning application notices from all 29 Tasmanian council websites, normalises and geocodes the data, and exposes it via a PHP search portal.
See VERSIONS.md for the changelog.
┌─────────────────────────────────────────────────────┐
│ 29 Ruby scrapers (scrapers/*.rb) │
│ Each polls one council website on a schedule │
└──────────────────────┬──────────────────────────────┘
│ upserts rows
▼
MariaDB (da_* tables)
│
┌─────────────┴─────────────┐
│ │
PHP web portal Adminer UI
(web/index.php) port 9980
port 9981
Services (Docker Compose):
| Service | Image | Port | Purpose |
|---|---|---|---|
db |
mariadb:10.11 |
3306 | Database |
scraper |
Custom (Ruby 3.2) | — | Runs all scrapers on a schedule |
web |
Custom (PHP/Apache) | 9981 | Search portal |
adminer |
adminer |
9980 | Database admin UI |
cp .env.example .env
# Edit .env — set DB passwords and your Google Maps API key
docker compose up -d
docker compose run --rm scraper /app/run_all.sh
Copy .env.example to .env and fill in the values. Never commit .env.
| Variable | Required | Description |
|---|---|---|
MYSQL_DATABASE |
Yes | Database name (default: planning_scrapes) |
MYSQL_USER |
Yes | Database username |
MYSQL_PASSWORD |
Yes | Database password |
MYSQL_ROOT_PASSWORD |
Yes | MariaDB root password |
GOOGLE_MAPS_API_KEY |
Yes | Used to geocode DA addresses |
LOOKUP_URL |
No | URL of the property lookup service (PID/title enrichment) |
LOOKUP_THROTTLE_MS |
No | Milliseconds between lookup requests (default: 150) |
SCRAPE_EVERY_MINUTES |
No | If set, the scraper loops on this interval (default: run once) |
DOWNLOAD_ATTACHMENTS |
No | Set to 1 to download PDF attachments |
DOWNLOAD_DIR |
No | Host path for downloaded PDFs (default: /app/downloads) |
DEBUG |
No | Set to 1 for verbose scraper output |
DRY_RUN |
No | Set to 1 to parse without writing to the DB |
ENRICH_DEBUG |
No | Set to 1 for verbose geocode/lookup output |
ALLOW_INSECURE |
No | Set to 1 to skip SSL verification (use only for legacy council sites) |
Use ONLY or SKIP environment variables with run_all.sh. Values are comma-separated scraper names (filename without .rb).
# Run only two councils
ONLY=meandervalley,kentish docker compose run --rm scraper /app/run_all.sh
# Run all except one
SKIP=hobartcity docker compose run --rm scraper /app/run_all.sh
Each scraper writes to its own da_* table. The table name is derived from the scraper filename.
| Council | Scraper file | DB table |
|---|---|---|
| Break O'Day | break_oday.rb |
da_break_oday |
| Brighton | brighton.rb |
da_brighton |
| Burnie | burnie.rb |
da_burnie |
| Central Coast | centralcoast.rb |
da_centralcoast |
| Central Highlands | centralhighlands.rb |
da_centralhighlands |
| Circular Head | circularhead.rb |
da_circularhead |
| Clarence | clarence.rb |
da_clarence |
| Derwent Valley | derwentvalley.rb |
da_derwentvalley |
| Devonport | devonportcity.rb |
da_devonportcity |
| Dorset | dorset.rb |
da_dorset |
| Flinders | flinders_council.rb |
da_flinders_council |
| George Town | georgetown.rb |
da_georgetown |
| Glamorgan Spring Bay | glamorgan.rb |
da_glamorgan |
| Glenorchy | glenorchy.rb |
da_glenorchy |
| Hobart | hobartcity.rb |
da_hobartcity |
| Huon Valley | huonvalley.rb |
da_huonvalley |
| Kentish | kentish.rb |
da_kentish |
| Kingborough | kingborough.rb |
da_kingborough |
| Latrobe | latrobe.rb |
da_latrobe |
| Launceston | launcestoncity.rb |
da_launcestoncity |
| Meander Valley | meandervalley.rb |
da_meandervalley |
| Northern Midlands | northernmidlands.rb |
da_northernmidlands |
| Southern Midlands | southernmidlands.rb |
da_southernmidlands |
| Sorell | (PlanBuild) | da_sorell |
| Tasman | tasman.rb |
da_tasman |
| Waratah–Wynyard | waratah_wynyard.rb |
da_waratah_wynyard |
| West Coast | westcoast.rb |
da_westcoast |
| West Tamar | westtamar.rb |
da_westtamar |
| Various (PlanBuild portal) | planbuild.rb |
Per-council da_* tables |
Every da_* table shares the same base schema:
| Column | Type | Notes |
|---|---|---|
id |
BIGINT |
Auto-increment PK |
council_reference |
VARCHAR(100) |
DA reference number |
address |
VARCHAR(255) |
Street address |
description |
TEXT |
Proposal description |
date_received |
DATE |
Application date |
on_notice_to |
DATE |
Public comment close date |
applicant |
VARCHAR(255) |
|
document_url |
TEXT |
Remote PDF URL |
local_document_url |
TEXT |
Downloaded PDF path (relative to /downloads) |
address_std |
VARCHAR(255) |
Google-normalised address |
lat / lng |
DECIMAL(10,7) |
Geocoded coordinates |
property_id |
TEXT |
Land title PID |
title_reference |
TEXT |
Certificate of title reference |
created_at / updated_at |
DATETIME |
Rows are upserted on (council_reference, address). Some fields are write-once (e.g. date_received) — the first value is kept on subsequent scrapes.
After each upsert, enrich_after_upsert! runs two optional enrichment steps:
Geocoding (requires GOOGLE_MAPS_API_KEY) — calls the Google Maps Geocoding API, caches results in the geo_cache table, and populates address_std, street, locality, state, postcode, lat, lng.
Property lookup (requires LOOKUP_URL) — POSTs {lat, lng} to a property data service and populates property_id and title_reference.
To run geocode backfill as a standalone pass over existing rows:
# All tables
docker compose run --rm \
-e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
scraper ruby /app/tools/backfill_geocode.rb
# Single table
docker compose run --rm \
-e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
-e ONLY_TABLE=da_dorset \
scraper ruby /app/tools/backfill_geocode.rb
# Dry run
docker compose run --rm \
-e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
-e ONLY_TABLE=da_dorset \
-e DRY_RUN=1 \
scraper ruby /app/tools/backfill_geocode.rb
lib/http.rb sends a full Chrome browser fingerprint on every request, including sec-ch-ua, Sec-Fetch-*, and Upgrade-Insecure-Requests headers. This satisfies most WAF checks without any extra scraper code.
For sites that additionally require a warm cookie state, the scraper does a proactive homepage GET before fetching the target URL. See burnie.rb for the reference implementation of this pattern (custom CookieJar class + http_get_with_cookies). Scrapers using this pattern: burnie.rb, kingisland.rb, latrobe.rb, derwentvalley.rb.
Cloudflare JS challenge (the "Just a moment..." interstitial) cannot be solved from a Docker/server environment regardless of headers, because Cloudflare's bot score is based on the client IP reputation. Sites confirmed blocked in Docker: derwentvalley.tas.gov.au, latrobe.tas.gov.au. These scrapers detect the challenge, log a warning, and exit cleanly.
scrapers/<councilname>.rb — use an existing simple scraper (e.g. glamorgan.rb) as a template.TABLE = ENV.fetch("TABLE_NAME")DB.ensure_table!(TABLE) — all schema columns are already includedDB.upsert(TABLE, row) with at least council_reference and addressenrich_after_upsert! after each upsertCOUNCIL_MAP in lib/util.rb if PlanBuild integration is needed.TABLE_NAME=da_<name> ruby scrapers/<name>.rb| Script | Purpose |
|---|---|
tools/backfill_geocode.rb |
Batch geocode backfill for existing rows (supports ONLY_TABLE, DRY_RUN) |
tools/backfill_dorset_docs.rb |
Backfill PDF links for Dorset rows |
tools/import_sqlites.rb |
Import data from legacy SQLite exports |
planbuild_fetch.js |
Playwright-based scraper for the PlanBuild TAS portal |
tas_councils/
├── lib/
│ ├── db.rb # DB connection, table creation, dynamic upsert logic
│ ├── http.rb # HTTP client — browser-fingerprint headers, retries, WAF warmup, curl fallback
│ ├── geocode.rb # Google Maps geocoding with SHA1 cache
│ ├── enrich.rb # Post-upsert enrichment pipeline
│ ├── util.rb # Date parsing, council/table name mappings
│ ├── scraper_helpers.rb# Shared helpers: abs_url, text_or, upsert_and_enrich!
│ └── migrate.rb # Sequential schema migration runner
├── scrapers/ # One .rb file per council
├── web/ # PHP search portal (Apache)
├── tools/ # Standalone backfill and migration scripts
├── run_all.sh # Discovers and runs scrapers (supports ONLY/SKIP)
├── entrypoint.sh # Docker entrypoint; optionally loops on a schedule
├── Dockerfile # Ruby 3.2 scraper image
├── docker-compose.yml # Full stack: db, scraper, web, adminer
└── .env # Secrets — never commit this file