|
@@ -0,0 +1,234 @@
|
|
|
|
|
+# TAS Councils Planning Applications Scraper
|
|
|
|
|
+
|
|
|
|
|
+A web scraping and data aggregation system for Tasmanian development applications (DAs). It collects planning application notices from all 29 Tasmanian council websites, normalises and geocodes the data, and exposes it via a PHP search portal.
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Architecture
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+┌─────────────────────────────────────────────────────┐
|
|
|
|
|
+│ 29 Ruby scrapers (scrapers/*.rb) │
|
|
|
|
|
+│ Each polls one council website on a schedule │
|
|
|
|
|
+└──────────────────────┬──────────────────────────────┘
|
|
|
|
|
+ │ upserts rows
|
|
|
|
|
+ ▼
|
|
|
|
|
+ MariaDB (da_* tables)
|
|
|
|
|
+ │
|
|
|
|
|
+ ┌─────────────┴─────────────┐
|
|
|
|
|
+ │ │
|
|
|
|
|
+ PHP web portal Adminer UI
|
|
|
|
|
+ (web/index.php) port 9980
|
|
|
|
|
+ port 9981
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+**Services (Docker Compose):**
|
|
|
|
|
+
|
|
|
|
|
+| Service | Image | Port | Purpose |
|
|
|
|
|
+|---|---|---|---|
|
|
|
|
|
+| `db` | `mariadb:10.11` | 3306 | Database |
|
|
|
|
|
+| `scraper` | Custom (Ruby 3.2) | — | Runs all scrapers on a schedule |
|
|
|
|
|
+| `web` | Custom (PHP/Apache) | 9981 | Search portal |
|
|
|
|
|
+| `adminer` | `adminer` | 9980 | Database admin UI |
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Quick Start
|
|
|
|
|
+
|
|
|
|
|
+### 1. Copy and configure environment
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+cp .env.example .env
|
|
|
|
|
+# Edit .env — set DB passwords and your Google Maps API key
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+### 2. Start all services
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+docker compose up -d
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+- Web portal: http://localhost:9981
|
|
|
|
|
+- Adminer: http://localhost:9980
|
|
|
|
|
+
|
|
|
|
|
+### 3. Run scrapers manually (once)
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+docker compose run --rm scraper /app/run_all.sh
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Environment Variables
|
|
|
|
|
+
|
|
|
|
|
+Copy `.env.example` to `.env` and fill in the values. **Never commit `.env`.**
|
|
|
|
|
+
|
|
|
|
|
+| Variable | Required | Description |
|
|
|
|
|
+|---|---|---|
|
|
|
|
|
+| `MYSQL_DATABASE` | Yes | Database name (default: `planning_scrapes`) |
|
|
|
|
|
+| `MYSQL_USER` | Yes | Database username |
|
|
|
|
|
+| `MYSQL_PASSWORD` | Yes | Database password |
|
|
|
|
|
+| `MYSQL_ROOT_PASSWORD` | Yes | MariaDB root password |
|
|
|
|
|
+| `GOOGLE_MAPS_API_KEY` | Yes | Used to geocode DA addresses |
|
|
|
|
|
+| `LOOKUP_URL` | No | URL of the property lookup service (PID/title enrichment) |
|
|
|
|
|
+| `LOOKUP_THROTTLE_MS` | No | Milliseconds between lookup requests (default: 150) |
|
|
|
|
|
+| `SCRAPE_EVERY_MINUTES` | No | If set, the scraper loops on this interval (default: run once) |
|
|
|
|
|
+| `DOWNLOAD_ATTACHMENTS` | No | Set to `1` to download PDF attachments |
|
|
|
|
|
+| `DOWNLOAD_DIR` | No | Host path for downloaded PDFs (default: `/app/downloads`) |
|
|
|
|
|
+| `DEBUG` | No | Set to `1` for verbose scraper output |
|
|
|
|
|
+| `DRY_RUN` | No | Set to `1` to parse without writing to the DB |
|
|
|
|
|
+| `ENRICH_DEBUG` | No | Set to `1` for verbose geocode/lookup output |
|
|
|
|
|
+| `ALLOW_INSECURE` | No | Set to `1` to skip SSL verification (use only for legacy council sites) |
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Running Scrapers Selectively
|
|
|
|
|
+
|
|
|
|
|
+Use `ONLY` or `SKIP` environment variables with `run_all.sh`. Values are comma-separated scraper names (filename without `.rb`).
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+# Run only two councils
|
|
|
|
|
+ONLY=meandervalley,kentish docker compose run --rm scraper /app/run_all.sh
|
|
|
|
|
+
|
|
|
|
|
+# Run all except one
|
|
|
|
|
+SKIP=hobartcity docker compose run --rm scraper /app/run_all.sh
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Council → Table Mapping
|
|
|
|
|
+
|
|
|
|
|
+Each scraper writes to its own `da_*` table. The table name is derived from the scraper filename.
|
|
|
|
|
+
|
|
|
|
|
+| Council | Scraper file | DB table |
|
|
|
|
|
+|---|---|---|
|
|
|
|
|
+| Break O'Day | `break_oday.rb` | `da_break_oday` |
|
|
|
|
|
+| Brighton | `brighton.rb` | `da_brighton` |
|
|
|
|
|
+| Burnie | `burnie.rb` | `da_burnie` |
|
|
|
|
|
+| Central Coast | `centralcoast.rb` | `da_centralcoast` |
|
|
|
|
|
+| Central Highlands | `centralhighlands.rb` | `da_centralhighlands` |
|
|
|
|
|
+| Circular Head | `circularhead.rb` | `da_circularhead` |
|
|
|
|
|
+| Clarence | `clarence.rb` | `da_clarence` |
|
|
|
|
|
+| Derwent Valley | `derwentvalley.rb` | `da_derwentvalley` |
|
|
|
|
|
+| Devonport | `devonportcity.rb` | `da_devonportcity` |
|
|
|
|
|
+| Dorset | `dorset.rb` | `da_dorset` |
|
|
|
|
|
+| Flinders | `flinders_council.rb` | `da_flinders_council` |
|
|
|
|
|
+| George Town | `georgetown.rb` | `da_georgetown` |
|
|
|
|
|
+| Glamorgan Spring Bay | `glamorgan.rb` | `da_glamorgan` |
|
|
|
|
|
+| Glenorchy | `glenorchy.rb` | `da_glenorchy` |
|
|
|
|
|
+| Hobart | `hobartcity.rb` | `da_hobartcity` |
|
|
|
|
|
+| Huon Valley | `huonvalley.rb` | `da_huonvalley` |
|
|
|
|
|
+| Kentish | `kentish.rb` | `da_kentish` |
|
|
|
|
|
+| Kingborough | `kingborough.rb` | `da_kingborough` |
|
|
|
|
|
+| Latrobe | `latrobe.rb` | `da_latrobe` |
|
|
|
|
|
+| Launceston | `launcestoncity.rb` | `da_launcestoncity` |
|
|
|
|
|
+| Meander Valley | `meandervalley.rb` | `da_meandervalley` |
|
|
|
|
|
+| Northern Midlands | `northernmidlands.rb` | `da_northernmidlands` |
|
|
|
|
|
+| Southern Midlands | `southernmidlands.rb` | `da_southernmidlands` |
|
|
|
|
|
+| Sorell | *(PlanBuild)* | `da_sorell` |
|
|
|
|
|
+| Tasman | `tasman.rb` | `da_tasman` |
|
|
|
|
|
+| Waratah–Wynyard | `waratah_wynyard.rb` | `da_waratah_wynyard` |
|
|
|
|
|
+| West Coast | `westcoast.rb` | `da_westcoast` |
|
|
|
|
|
+| West Tamar | `westtamar.rb` | `da_westtamar` |
|
|
|
|
|
+| Various (PlanBuild portal) | `planbuild.rb` | Per-council `da_*` tables |
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Database Schema
|
|
|
|
|
+
|
|
|
|
|
+Every `da_*` table shares the same base schema:
|
|
|
|
|
+
|
|
|
|
|
+| Column | Type | Notes |
|
|
|
|
|
+|---|---|---|
|
|
|
|
|
+| `id` | `BIGINT` | Auto-increment PK |
|
|
|
|
|
+| `council_reference` | `VARCHAR(100)` | DA reference number |
|
|
|
|
|
+| `address` | `VARCHAR(255)` | Street address |
|
|
|
|
|
+| `description` | `TEXT` | Proposal description |
|
|
|
|
|
+| `date_received` | `DATE` | Application date |
|
|
|
|
|
+| `on_notice_to` | `DATE` | Public comment close date |
|
|
|
|
|
+| `applicant` | `VARCHAR(255)` | |
|
|
|
|
|
+| `document_url` | `TEXT` | Remote PDF URL |
|
|
|
|
|
+| `local_document_url` | `TEXT` | Downloaded PDF path (relative to `/downloads`) |
|
|
|
|
|
+| `address_std` | `VARCHAR(255)` | Google-normalised address |
|
|
|
|
|
+| `lat` / `lng` | `DECIMAL(10,7)` | Geocoded coordinates |
|
|
|
|
|
+| `property_id` | `TEXT` | Land title PID |
|
|
|
|
|
+| `title_reference` | `TEXT` | Certificate of title reference |
|
|
|
|
|
+| `created_at` / `updated_at` | `DATETIME` | |
|
|
|
|
|
+
|
|
|
|
|
+Rows are upserted on `(council_reference, address)`. Some fields are **write-once** (e.g. `date_received`) — the first value is kept on subsequent scrapes.
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Enrichment Pipeline
|
|
|
|
|
+
|
|
|
|
|
+After each upsert, `enrich_after_upsert!` runs two optional enrichment steps:
|
|
|
|
|
+
|
|
|
|
|
+1. **Geocoding** (requires `GOOGLE_MAPS_API_KEY`) — calls the Google Maps Geocoding API, caches results in the `geo_cache` table, and populates `address_std`, `street`, `locality`, `state`, `postcode`, `lat`, `lng`.
|
|
|
|
|
+
|
|
|
|
|
+2. **Property lookup** (requires `LOOKUP_URL`) — POSTs `{lat, lng}` to a property data service and populates `property_id` and `title_reference`.
|
|
|
|
|
+
|
|
|
|
|
+To run enrichment as a standalone backfill over existing rows:
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+docker compose run --rm \
|
|
|
|
|
+ -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
|
|
|
|
|
+ -e LOOKUP_URL="$LOOKUP_URL" \
|
|
|
|
|
+ scraper ruby /app/tools/enrich.rb
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+Run against a single table with a dry run:
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+docker compose run --rm \
|
|
|
|
|
+ -e GOOGLE_MAPS_API_KEY="$GOOGLE_MAPS_API_KEY" \
|
|
|
|
|
+ -e LOOKUP_URL="$LOOKUP_URL" \
|
|
|
|
|
+ -e DRY_RUN=1 \
|
|
|
|
|
+ scraper ruby /app/tools/enrich.rb --table=da_dorset
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Adding a New Scraper
|
|
|
|
|
+
|
|
|
|
|
+1. Create `scrapers/<councilname>.rb` — use an existing simple scraper (e.g. `glamorgan.rb`) as a template.
|
|
|
|
|
+2. At minimum the scraper must:
|
|
|
|
|
+ - Read `TABLE = ENV.fetch("TABLE_NAME")`
|
|
|
|
|
+ - Call `DB.ensure_table!(TABLE)` and `ensure_extra_columns!(TABLE)`
|
|
|
|
|
+ - Call `DB.upsert(TABLE, row)` with at least `council_reference` and `address`
|
|
|
|
|
+ - Call `enrich_after_upsert!` after each upsert
|
|
|
|
|
+3. Add the council to `COUNCIL_MAP` in `lib/util.rb` if PlanBuild integration is needed.
|
|
|
|
|
+4. Test locally: `TABLE_NAME=da_<name> ruby scrapers/<name>.rb`
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Tools
|
|
|
|
|
+
|
|
|
|
|
+| Script | Purpose |
|
|
|
|
|
+|---|---|
|
|
|
|
|
+| `tools/enrich.rb` | Batch geocode + property lookup for existing rows |
|
|
|
|
|
+| `tools/backfill_geocode.rb` | Geocode-only backfill |
|
|
|
|
|
+| `tools/backfill_dorset_docs.rb` | Backfill PDF links for Dorset rows |
|
|
|
|
|
+| `tools/import_sqlites.rb` | Import data from legacy SQLite exports |
|
|
|
|
|
+| `planbuild_fetch.js` | Playwright-based scraper for the PlanBuild TAS portal |
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Project Structure
|
|
|
|
|
+
|
|
|
|
|
+```
|
|
|
|
|
+tas_councils/
|
|
|
|
|
+├── lib/
|
|
|
|
|
+│ ├── db.rb # DB connection, table creation, upsert logic
|
|
|
|
|
+│ ├── http.rb # HTTP client with retries, cookie jar, WAF warmup
|
|
|
|
|
+│ ├── geocode.rb # Google Maps geocoding with SHA1 cache
|
|
|
|
|
+│ ├── enrich.rb # Post-upsert enrichment pipeline
|
|
|
|
|
+│ └── util.rb # Date parsing, council/table name mappings
|
|
|
|
|
+├── scrapers/ # One .rb file per council
|
|
|
|
|
+├── web/ # PHP search portal (Apache)
|
|
|
|
|
+├── tools/ # Standalone backfill and migration scripts
|
|
|
|
|
+├── run_all.sh # Discovers and runs scrapers (supports ONLY/SKIP)
|
|
|
|
|
+├── entrypoint.sh # Docker entrypoint; optionally loops on a schedule
|
|
|
|
|
+├── Dockerfile # Ruby 3.2 scraper image
|
|
|
|
|
+├── docker-compose.yml # Full stack: db, scraper, web, adminer
|
|
|
|
|
+└── .env # Secrets — never commit this file
|
|
|
|
|
+```
|