Every developer who has built web scrapers knows the pain:
Fragile CSS selectors/XPaths: The target website updates its Tailwind classes or shifts its React component tree, and your data pipeline crashes.
Web Application Firewalls (WAFs): Cloudflare, DataDome, and Akamai block your requests at the edge, returning a 403 Forbidden or challenge page.
We wanted to build a scraping engine that bypasses selectors entirely and handles anti-bot systems resilience.
Here is the exact architecture we used to build QueryScrape AI using FastAPI, Playwright, and Gemini 2.5 Flash.
🛠️ The Architecture Stack
Our scraper relies on a three-stage pipeline:
The Stealth Crawler (Playwright): Launches a headless Chromium instance with custom user-agents, screen sizes, and browser flags to bypass basic anti-bot blockers and execute client-side JavaScript.
The Dom Cleaner (BeautifulSoup & html2text): Strips noisy scripts, styles, headers, and footers, converting raw HTML into token-efficient Markdown.
Dynamic Pydantic Schema compiler & Gemini structured output: Compiles fields submitted in the API request into Pydantic models at runtime, forcing the LLM to output validated JSON matching that model.
POST URL + Schema
Fetch Page
HTML Source
Clean Markdown
Compile fields
Output validation schema
Validated JSON output
API User
FastAPI Server
Playwright Stealth Crawler
DOM Cleaner
Gemini 2.5 Flash
Dynamic Pydantic Model
💻 Under The Hood: The Core Code
- Cleaning raw HTML to Markdown Passing a raw DOM tree with thousands of lines of HTML to an LLM wastes tokens and blows up latency. We strip non-content tags and convert the HTML structure to Markdown:
from bs4 import BeautifulSoup
import html2text
def clean_html(html_content: str) -> str:
soup = BeautifulSoup(html_content, "html.parser")
# Remove script/style boilerplate and noisy non-content elements
for element in soup(["script", "style", "nav", "footer", "header", "svg", "noscript", "iframe"]):
element.decompose()
# Convert remaining DOM structure to clean Markdown
h = html2text.HTML2Text()
h.ignore_links = False
h.ignore_images = True
h.ignore_tables = False
h.body_width = 0 # Do not wrap lines
return h.handle(str(soup)).strip()
- Compiling Pydantic schemas dynamically at runtime When a user calls our API, they pass an array of fields they want to extract, like:
[
{"name": "title", "type": "string", "description": "The title of the product"},
{"name": "price", "type": "float", "description": "The numerical price in USD"}
]
We use Pydantic's create_model function to compile these fields into a validated Pydantic class dynamically:
from pydantic import BaseModel, Field, create_model
from typing import List, Dict, Any, Type
TYPE_MAP = {
"string": str, "integer": int, "float": float, "boolean": bool
}
def compile_schema(fields: List[Dict[str, Any]], is_list: bool = False) -> Type[BaseModel]:
pydantic_fields = {}
for f in fields:
name = f.get("name")
type_str = f.get("type", "string").lower()
desc = f.get("description", f"Extracted value for {name}")
py_type = TYPE_MAP.get(type_str, str)
pydantic_fields[name] = (py_type, Field(description=desc))
ItemModel = create_model("ExtractedItem", **pydantic_fields)
if is_list:
return create_model(
"ExtractedList",
items=(List[ItemModel], Field(description="A collection of extracted items."))
)
return ItemModel
- Invoking Gemini with Structured Schema Enforcement We utilize the modern google-genai SDK. By passing the compiled Pydantic schema class directly in response_schema of the generation configuration, Gemini guarantees the output format, saving us from writing fragile regex retry loops:
from google import genai
from google.genai import types
async def extract_structured_data(content: str, fields: list, is_list: bool = False):
client = genai.Client(api_key=YOUR_API_KEY)
target_model = compile_schema(fields, is_list)
prompt = f"Extract structured information according to the schema:\n\n{content}"
response = await client.aio.models.generate_content(
model="gemini-2.5-flash",
contents=prompt,
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=target_model,
temperature=0.1
)
)
return response.parsed.model_dump()
🛡️ Bypassing Edge Firewalls (WAF)
To show this in action, we built a public WAF Shield Detector tool directly into our landing page.
It does two things:
Performs a naive static HTTP connection (which easily triggers WAF response headers like cf-ray or x-datadome-cid and yields a 403 Blocked page).
Runs our dynamic stealth-Playwright crawler (which loads the target page, executes JS, and extracts clean markdown).
The side-by-side comparison on our playground visually proves that AI-powered crawlers can bypass edge bot-guards and feed structured data directly to LLMs.
📈 Learn More & Try it Out
We have deployed the fully operational API pipeline on AWS App Runner.
👉 Try the live playground, check your target URLs for WAF headers, and get 1,000 free API extractions per month: 🔗 Live Demo Playground
The full source code for the server and playground is open-source. What are your thoughts on using generative models for resilient data pipelines? Let's discuss in the comments below!













