2 bulan lalu · e23a80fe51
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -184,13 +184,13 @@ The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everyth
 
				 ## Common Gotchas
			
 
				 
			
 
				 - **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
			
 
				-- **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
			
 
				+- **`COUNCIL_FILTER` env var**: Used only by `scrapers/hobartcity.rb` to filter which councils to scrape from the Hobart eProperty portal. It has no effect on any other scraper or on `run_all.sh`. To run a subset of scrapers, use `ONLY` / `SKIP` in `run_all.sh`.
			
 
				 - **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
			
 
				 - **PDF download path**: `local_document_url` must begin with `/files/` (not `/downloads/`). The Apache alias in `web/000-files.conf` is `Alias /files /srv/files`. Using `/downloads/` results in 404 in the web portal.
			
 
				 - **Binary PDF downloads**: Pass `headers: { "Accept" => "application/pdf,*/*", "Referer" => URL }` to `Http.get` when downloading PDFs from CDN subdomains — some CDNs reject requests without a valid referrer.
			
 
				 - **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
			
 
				 - **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
			
 
				-- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.
			
 
				+- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
			
 
				 - **`group` column**: This is a reserved SQL word. In `DB.upsert` it is safe because all column names are backtick-quoted. In raw SQL always write `` `group` ``.
			
 
				 
			
 
				 ---
			
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -35,7 +35,6 @@ services:
 
				       MYSQL_PASSWORD: ${MYSQL_PASSWORD}
			
 
				       # If set, the runner loops. Minutes between runs:
			
 
				       SCRAPE_EVERY_MINUTES: "720"
			
 
				-      COUNCIL_WHITELIST: "Hobart City Council,Launceston City Council,Clarence City Council"
			
 
				       GOOGLE_MAPS_API_KEY: ${GOOGLE_MAPS_API_KEY}
			
 
				       DOWNLOAD_ATTACHMENTS: "1"
			
 
				       DOWNLOAD_DIR: /downloads
			
--- a/scrapers/kingisland.rb
+++ b/scrapers/kingisland.rb
@@ -91,6 +91,8 @@ def http_get(url, jar:, referer: nil, fetch_site: "none")
 
				     body  = ""
			
 
				 
			
 
				     while limit > 0
			
 
				+        limit -= 1
			
 
				+        redirect_to = nil
			
 
				         req = Net::HTTP::Get.new(uri, hdrs)
			
 
				         Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == "https") do |http|
			
 
				             resp = http.request(req)
			
@@ -98,12 +100,14 @@ def http_get(url, jar:, referer: nil, fetch_site: "none")
 
				             code = resp.code.to_i
			
 
				 
			
 
				             if [301, 302, 303, 307, 308].include?(code) && resp["location"]
			
 
				-                uri = URI.join(uri, resp["location"])
			
 
				-                limit -= 1
			
 
				-                next
			
 
				+                redirect_to = URI.join(uri, resp["location"])
			
 
				+            else
			
 
				+                body = resp.body.to_s
			
 
				             end
			
 
				-
			
 
				-            body = resp.body.to_s
			
 
				+        end
			
 
				+        if redirect_to
			
 
				+            uri = redirect_to
			
 
				+            next
			
 
				         end
			
 
				         break
			
 
				     end