5 patterns for resilient web scraping at scale
If your scraper runs once a day, almost anything works. Run it every minute against a target that fights back and the cracks show up fast.
Below are five patterns we keep coming back to when we help teams move scrapers from “works on my laptop” to “runs unattended for months.”
1. Treat the proxy as part of the request, not part of the network
The simplest mistake is configuring proxies at the OS or HTTP-client level once and forgetting about them. That works until you need to rotate per worker, hold a session for a login flow, or switch country mid-job. By then the proxy is hard to reach in your code.
Make the proxy URL, session token, and target country part of the request object your scraper passes around. When a worker decides to retry, it can pick a different session. When a job needs five countries in parallel, it just sets a field.
A proxy is a configuration value, not infrastructure. Treat it the same way you treat a database connection string per tenant.
The benefit goes beyond cleaner code. Once the proxy is a per-request value, you can attach instrumentation to it. Every log line and every metric carries the route it used. When something starts failing, you can answer “is this a target change or a route change” in one query.
2. Separate “rotate every request” from “hold a session”
Most scrapers need both behaviours, just on different routes. Rotate every request for breadth: search pages, listing crawls, sitemaps. Hold a session for continuity: login, multi-step forms, cart flows, anything where the target expects the same visitor.
The cleanest way to do this is one credential pattern with a session token appended when you need stickiness. Same account, same dashboard, same code path. A boolean on the job config decides which behaviour wins.
The trap to avoid is keeping a sticky session for longer than the workflow needs. Long-lived sessions burn trust budget on identities that no longer have a job to do. End the session at the same moment the workflow ends.
3. Make every retry cheaper than the last
A naive scraper retries the same URL with the same proxy and the same headers. By the third try you have spent three times the budget for the same answer.
Each retry should change one variable: a different session, a different country, a different request shape. Then track which variables actually helped. After a few runs you will know which targets are sensitive to which inputs, and your retry policy can stop guessing.
If you log the variable you changed alongside the outcome, you can build a small decision table that the worker queries before retrying. This is much cheaper than burning bandwidth on identical attempts, and it usually surfaces a configuration win you would not have spotted by looking at one failure at a time.
4. Decide your stop condition before you start
The hardest scraper bugs are the ones where the job “succeeds” but the data is wrong. The page rendered, the parser returned rows, the writer pushed them to a warehouse. Two days later someone notices the rows are empty placeholders.
Define a per-target stop condition that has nothing to do with HTTP status. A minimum row count, a hash of a known element, the presence of a specific text fragment. If the stop condition is not met, the job goes back into the queue with a different proxy, not into your warehouse.
Stop conditions also save you from silent target changes. When a layout shifts and your parser starts returning the wrong shape, the stop condition fires before the bad data lands anywhere downstream.
5. Watch consumption alongside success rate
Success rate alone hides regressions. A scraper that goes from 80 percent success to 78 percent success but uses three times the bandwidth is in trouble. The cost moved before the failure did.
Two charts side by side: requests-per-row and proxy-bytes-per-row. When either climbs, you have a target that has changed under you. Catch the change before the target’s defences harden further and you usually get a cheap fix. Catch it after, and the fix is a rewrite.
Bringing it together
None of these patterns require exotic tools. They require a scraper that knows which proxy it is using, which session it is in, and what success looks like before it asks the network for anything.
That is the difference between a job you tune once a quarter and one you tune every other night.
If you are starting from scratch on a high-volume crawl, the platform we recommend in most cases is Platinum for ISP-quality routing and Premium Unlimited when you want bandwidth to disappear as a concern. Both surface the same credential pattern, so the patterns above port between them without changing your code.
ProxyOmega