Jelajahi Sumber

Gotchas Updated

Benjamin Harris 2 bulan lalu
induk
melakukan
e23a80fe51
3 mengubah file dengan 11 tambahan dan 8 penghapusan
  1. 2 2
      CLAUDE.md
  2. 0 1
      docker-compose.yml
  3. 9 5
      scrapers/kingisland.rb

+ 2 - 2
CLAUDE.md

@@ -184,13 +184,13 @@ The shared infrastructure (`Http`, `DB`, `enrich_after_upsert!`) handles everyth
 ## Common Gotchas
 
 - **`TABLE` constant conflicts**: Each scraper defines `TABLE = ENV.fetch("TABLE_NAME")` at the top level. If you `require` two scrapers in the same Ruby process you'll get a constant redefinition warning. Each scraper is designed to be run as a standalone script.
-- **`COUNCIL_FILTER` / `COUNCIL_WHITELIST`**: The `docker-compose.yml` has a `COUNCIL_WHITELIST` env var that is passed to the scraper container but is not wired into `run_all.sh`. Use `ONLY` / `SKIP` in `run_all.sh` instead.
+- **`COUNCIL_FILTER` env var**: Used only by `scrapers/hobartcity.rb` to filter which councils to scrape from the Hobart eProperty portal. It has no effect on any other scraper or on `run_all.sh`. To run a subset of scrapers, use `ONLY` / `SKIP` in `run_all.sh`.
 - **PlanBuild scrapers**: `planbuild.rb` handles councils on the state-run PlanBuild portal. It writes to per-council tables using `Util.ref_to_table`. These run alongside the council-specific scrapers.
 - **PDF download path**: `local_document_url` must begin with `/files/` (not `/downloads/`). The Apache alias in `web/000-files.conf` is `Alias /files /srv/files`. Using `/downloads/` results in 404 in the web portal.
 - **Binary PDF downloads**: Pass `headers: { "Accept" => "application/pdf,*/*", "Referer" => URL }` to `Http.get` when downloading PDFs from CDN subdomains — some CDNs reject requests without a valid referrer.
 - **Non-ASCII in PDF URLs**: Some council sites embed Unicode characters (e.g. en-dash `–`) directly in PDF filenames. Always percent-encode hrefs before passing to `URI.join` — see `burnie.rb` `first_pdf_on_detail` for the pattern.
 - **Redirect loops in `Net::HTTP.start` blocks**: `next` inside a `Net::HTTP.start` block exits the block, not the enclosing `while` loop. Use a `redirect_to` variable set inside the block and call `next` on the `while` loop after the block returns — see `burnie.rb` `http_get_with_cookies`.
-- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`.
+- **Cloudflare JS challenge vs IP block**: A JS challenge (`"Just a moment"`) may work from a residential IP but always block from a datacenter/Docker IP. Detect it and exit cleanly. Sites confirmed blocked in Docker: `derwentvalley.tas.gov.au`, `latrobe.tas.gov.au`, `kentish.tas.gov.au`, `centralhighlands.tas.gov.au`. Where a PlanBuild equivalent exists (council code in `COUNCIL_MAP`), data is still collected via `planbuild.rb`.
 - **`group` column**: This is a reserved SQL word. In `DB.upsert` it is safe because all column names are backtick-quoted. In raw SQL always write `` `group` ``.
 
 ---

+ 0 - 1
docker-compose.yml

@@ -35,7 +35,6 @@ services:
       MYSQL_PASSWORD: ${MYSQL_PASSWORD}
       # If set, the runner loops. Minutes between runs:
       SCRAPE_EVERY_MINUTES: "720"
-      COUNCIL_WHITELIST: "Hobart City Council,Launceston City Council,Clarence City Council"
       GOOGLE_MAPS_API_KEY: ${GOOGLE_MAPS_API_KEY}
       DOWNLOAD_ATTACHMENTS: "1"
       DOWNLOAD_DIR: /downloads

+ 9 - 5
scrapers/kingisland.rb

@@ -91,6 +91,8 @@ def http_get(url, jar:, referer: nil, fetch_site: "none")
     body  = ""
 
     while limit > 0
+        limit -= 1
+        redirect_to = nil
         req = Net::HTTP::Get.new(uri, hdrs)
         Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == "https") do |http|
             resp = http.request(req)
@@ -98,12 +100,14 @@ def http_get(url, jar:, referer: nil, fetch_site: "none")
             code = resp.code.to_i
 
             if [301, 302, 303, 307, 308].include?(code) && resp["location"]
-                uri = URI.join(uri, resp["location"])
-                limit -= 1
-                next
+                redirect_to = URI.join(uri, resp["location"])
+            else
+                body = resp.body.to_s
             end
-
-            body = resp.body.to_s
+        end
+        if redirect_to
+            uri = redirect_to
+            next
         end
         break
     end