This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.
TL;DR
Use AlterLab's Extract API to get structured JSON from AliExpress. Define a JSON schema for the fields you need (title, price, currency, SKU, availability), POST the URL and schema to https://api.alterlab.io/v1/extract, and receive validated typed JSON — no HTML parsing required.
Why use AliExpress data?
AliExpress hosts millions of product listings that are useful for:
- Training machine learning models on e‑commerce product attributes
- Building price‑tracking analytics dashboards for competitive intelligence
- Enriching recommendation systems with real‑time catalog data
These use cases rely on fresh, structured data rather than raw HTML, which is why a data API approach saves engineering time.
What data can you extract?
From any publicly accessible AliExpress product or search page you can pull:
- title – product name as displayed
- price – current sale price as a string
- currency – three‑letter ISO code (e.g., USD, EUR)
- sku – seller‑specific stock keeping unit
- availability – in‑stock status or shipping estimate
- rating – average star rating and review count
- image URLs – primary and gallery images All fields are optional in your schema; AlterLab will return only what you request, typed according to the schema definition.
The extraction approach
Traditional scraping with raw HTTP + HTML parsing is fragile because:
- AliExpress frequently updates its front‑end markup
- JavaScript‑rendered content requires a headless browser
- Anti‑bot mechanisms (CAPTCHAs, rate limits) block simple requests AlterLab handles these challenges automatically: rotating proxies, automatic retries, JavaScript rendering, and AI‑driven data understanding. You interact with a clean data API instead of maintaining parsers.
Quick start with AlterLab Extract API
First, install the Python SDK (or use cURL directly). See the Getting started guide for setup.
Python example
```python title="extract_aliexpress-com.py" {5-12}
client = alterlab.Client("YOUR_API_KEY")
schema = {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "The product title"
},
"price": {
"type": "string",
"description": "Current price"
},
"currency": {
"type": "string",
"description": "ISO currency code"
},
"sku": {
"type": "string",
"description": "Seller SKU"
},
"availability": {
"type": "string",
"description": "In stock status"
},
"rating": {
"type": "string",
"description": "Average rating"
}
}
}
result = client.extract(
url="https://www.aliexpress.com/item/1005005498765432.html",
schema=schema,
)
print(result.data)
The highlighted lines show schema definition and the extract call. The SDK returns a `result.data` object that matches the schema exactly.
### cURL example
```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/extract \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.aliexpress.com/item/1005005498765432.html",
"schema": {
"properties": {
"title": {"type": "string"},
"price": {"type": "string"},
"currency": {"type": "string"}
}
}
}'
This produces a JSON response like:
{
"title": "Wireless Bluetooth Earphones",
"price": "12.99",
"currency": "USD"
}
Batch/async example (Python)
For high‑volume pipelines, use the async endpoint to process many URLs in parallel:
```python title="batch_aliexpress.py" {8-15}
client = alterlab.Client("YOUR_API_KEY")
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "string"},
"availability": {"type": "string"}
}
}
async def extract_one(url):
return await client.extract_async(url=url, schema=schema)
async def main():
urls = [
"https://www.aliexpress.com/item/1005005498765432.html",
"https://www.aliexpress.com/item/1005005601234567.html",
"https://www.aliexpress.com/item/1005005709876543.html",
]
results = await asyncio.gather(*[extract_one(u) for u in urls])
for r in results:
print(r.data)
asyncio.run(main())
The `extract_async` method returns a coroutine; `asyncio.gather` runs them concurrently, respecting AlterLab’s rate limits internally.
## Define your schema
The schema parameter follows JSON Schema Draft‑07. AlterLab validates the extracted content against it and coerces types where possible (e.g., turning "12,99" into a string). You can add nested objects or arrays if the page contains lists (e.g., multiple image URLs). A typical e‑commerce schema looks like:
```json
{
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "string"},
"currency": {"type": "string"},
"sku": {"type": "string"},
"availability": {"type": "string"},
"rating": {"type": "string"},
"images": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["title", "price", "currency"]
}
Required fields ensure you get a response only when AlterLab can confidently extract them; otherwise the API returns a partial result with missing fields omitted.
Handle pagination and scale
AliExpress search results paginate via URL parameters like page=2. To scrape a category:
- Generate a list of page URLs.
- Fire extract requests in batches (e.g., 50 URLs per batch).
- Use AlterLab’s built‑in concurrency or your own async worker pool.
- Store each JSON line in a data lake (e.g., AWS S3) for downstream processing.
Cost scales linearly with successful requests. Check the AlterLab pricing page for per‑request rates; there are no hidden fees and unused credits roll over indefinitely.
Key takeaways
- AlterLab’s Extract API turns any public AliExpress page into typed JSON with a single POST.
- Define a JSON schema to get exactly the fields you need, validated and ready for pipelines.
- The platform manages proxies, JavaScript rendering, and anti‑bot bypass so you focus on data usage, not maintenance.
- Start with the Python SDK or cURL, then scale with async batch jobs for production workloads.
- Always verify that your target pages are public and comply with AliExpress’s robots.txt and Terms of Service.













