BATCH INFERENCE

PROCESS MILLIONS.
PAY HALF.

Submit a file of requests, get results back when they're done. Async bulk inference at 50% off real-time pricing. Built for workloads that can trade latency for cost.

CONTACT SALES

JSONL IN. JSONL OUT.

One request per line. Each with a custom_id for result correlation. Upload the file, poll for completion, download the output. Same OpenAI-compatible format you already use.

GET STARTED

 from openai import OpenAI

client = OpenAI(
    base_url="https://api.haimaker.ai/v1",
    api_key="your-api-key",
)

# Upload input file
batch_file = client.files.create(
    file=open("requests.jsonl", "rb"),
    purpose="batch",
)

# Create batch job
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
) 
 # Upload input file
curl https://api.haimaker.ai/v1/files \
  -H "Authorization: Bearer your-api-key" \
  -F purpose="batch" \
  -F file="@requests.jsonl"

# Create batch job
curl https://api.haimaker.ai/v1/batch/jobs \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file-abc123",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
  }' 
 # requests.jsonl — one request per line
{"custom_id": "req-1", "body": {"model": "meta-llama/Llama-3.1-8B", "messages": [{"role": "user", "content": "Summarize this document..."}]}}
{"custom_id": "req-2", "body": {"model": "meta-llama/Llama-3.1-8B", "messages": [{"role": "user", "content": "Classify this ticket..."}]}}
{"custom_id": "req-3", "body": {"model": "meta-llama/Llama-3.1-8B", "messages": [{"role": "user", "content": "Extract entities from..."}]}} 

50% COST SAVINGS

Batch jobs fill idle GPU capacity during off-peak hours. You get the same models and the same output quality at half the per-token price. The discount comes from scheduling flexibility, not corners cut.

SIMPLE FORMAT

JSONL in, JSONL out. Every request gets a custom_id that maps directly to its result. No new SDKs, no new APIs to learn. If you can write a for loop, you can build a batch job.

BUILT FOR SCALE

50,000 requests per batch. 100MB file uploads. 24-hour best-effort SLA. Separate rate limits from real-time traffic so your batch jobs never starve your production endpoints.

BUILT FOR

Workloads that value cost over latency.

BULK EVALUATION

Run eval suites across model versions, prompt variations, and parameter sweeps. Compare thousands of outputs without burning through your real-time budget.

DOCUMENT PROCESSING

Summarize, classify, or extract from millions of documents overnight. Legal discovery, medical records, support tickets — anything that sits in a queue.

DATA ENRICHMENT

Add AI-generated annotations, embeddings, or metadata to your datasets. Enrich your data warehouse while your team sleeps.

READY TO PROCESS AT SCALE?

Tell us about your workload and we'll set up batch processing for your team.