I spent two weeks fighting a website that refused to play nice. Every scrape came back with missing columns, merged cells, or data that looked right but was subtly wrong. I tried BeautifulSoup, regex, pandas read_html – all the usual suspects. Nothing worked reliably because the HTML was a nightmare of inline styles, rowspan attributes, and inconsistent class names.
Let me back up. I needed to pull apartment rental data from a dozen property management sites. Most were fine, but one particular site (let's call it ShadowListings) had a table that looked like this when rendered:
| Unit | Price | Beds | Available |
|---|---|---|---|
| 101 | $1200 | 1 | 2024-03-01 |
| 102 | $1500 | 2 | 2024-04-01 |
But the underlying HTML was a mess:
<table>
<tr>
<td rowspan="2">Unit</td>
<td class="price-col">$1200</td>
<td>1</td>
<td><span>2024-03-01</span></td>
</tr>
<tr>
<td>$1500</td>
<td>2</td>
<td>2024-04-01</td>
</tr>
</table>
The first row had a rowspan="2" that threw off column alignment. And the class names changed between pages. My standard approach broke instantly.
What I Tried That Didn't Work
BeautifulSoup + manual column detection: I tried walking the DOM, tracking rowspan and colspan offsets. It worked for one page but broke on the next because the structure changed slightly.
Regex on the text: I thought I could just match price patterns and guess the rest. But dates were in multiple formats, and some rows had no price at all.
Pandas pd.read_html: It actually parsed the first table correctly, but then failed on sub-tables and returned a DataFrame with NaN everywhere because of misaligned cells.
Headless browser + OCR: Overkill, slow, and still needed heuristics to extract the actual values.
After a week of frustration, I started thinking: the human eye can instantly understand this table. Why? Because we don't parse HTML structure — we read the visual layout and use context. That's when I turned to AI.
What Eventually Worked: LLM-Powered Table Extraction
Instead of trying to fix the parser, I gave the raw HTML (or even a screenshot) to an LLM and asked it to extract the data. The trick was to use few-shot prompting with an example of the desired output.
Here's the approach I settled on:
- Fetch the HTML as text (or use a screenshot if the site uses JavaScript rendering).
- Strip out irrelevant tags (scripts, styles) to reduce token count.
- Ask the LLM to output a JSON array of objects, with explicit column names.
- Validate the output with a schema (e.g., price must be a string, date must match ISO format).
My first attempt used GPT-4, but even 3.5 worked fine for simple tables. The key was the prompt:
import openai
def extract_table_from_html(html_text):
prompt = f"""
You are given raw HTML of a table. Extract the data into a JSON array of objects, where each object has keys: "unit", "price", "beds", "available_date".
Ignore any formatting issues like rowspan or colspan. Use the visual layout to infer columns.
HTML:
{html_text[:3000]} # Trim to avoid token limits
Output ONLY valid JSON. No explanations.
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return response.choices[0].message.content
Wait, that naive version had problems: the LLM sometimes invented data or hallucinated columns. I needed few-shot examples and a validation step. Here's the refined version:
import json
from pydantic import BaseModel
def extract_and_validate(html):
few_shot = """
Example 1:
Input: <table>...</table>
Output: [{"unit":"A1","price":"$1000","beds":"1","available_date":"2024-01-01"}]
Now do the same for this HTML:
"""
prompt = few_shot + html[:4000]
# Call LLM
raw = call_llm(prompt)
# Parse JSON safely
try:
data = json.loads(raw)
except json.JSONDecodeError:
# fallback: ask again with stricter prompt
return retry_with_stricter(html)
# Validate each row using Pydantic
class Row(BaseModel):
unit: str
price: str
beds: str
available_date: str
validated = [Row(**row).dict() for row in data]
return validated
I ended up using the OpenAI API with a small wrapper I built. The cost was negligible - about $0.02 per page. Much cheaper than manual data entry.
Lessons Learned / Trade-offs
- Latency: Each extraction takes 1-3 seconds. If you need to scrape 10,000 pages, that's hours and dollars. Not ideal for bulk.
- Cost: At scale, it adds up. I used it only for the difficult 5% of pages where traditional parsing failed.
- Accuracy: LLMs still make mistakes on ambiguous layouts. Adding few-shot examples helped, but sometimes it missed a column or merged two rows. I always run a sanity check (e.g., ensure price contains "$").
- Feedback loop: I saved failures and retrained my prompt. Over time, extraction became more reliable.
- Privacy: Sending HTML to a third-party API? Not always allowed. I used an internal LLM (via a service like the one at ai.interwestinfo.com) for sensitive pages.
What I'd Do Differently Next Time
- Try a smaller, fine-tuned model first. For a fixed table format, you could train a tiny model to output structured data directly, avoiding the overhead of a general LLM.
- Use vision models for rendered tables. If the HTML is too mangled, a screenshot can be passed to GPT-4 Vision, which often does better at understanding layout.
- Add human-in-the-loop for edge cases. I built a simple web UI that flagged uncertain extractions (e.g., confidence score < 0.8). Human review of those 10% kept quality high.
- Cache aggressively. I cached extracted data by page URL so I only paid once per page.
When NOT to Use This Approach
- If your HTML is well-structured and consistent, use pandas or BeautifulSoup. They're faster, free, and deterministic.
- If you need real-time extraction (e.g., live dashboard), LLM latency kills you.
- If your data contains PII or trade secrets, think twice about sending it to an external API.
This technique saved my project. It turned an impossible scraping task into a weekend project. Now I treat LLMs as my "parser of last resort" — only for the messy cases that resist all logic.
What's your approach when traditional parsing fails? I'd love to hear if you've experimented with LLMs for data extraction.












