Batch Processing Guide¶

Understanding Batch Processing¶

Batch processing allows you to process multiple media files in a single operation, offering significant efficiency for large-scale ingestion and management of your content. Unlike individual processing, batch operations happen asynchronously, allowing you to monitor progress and retrieve results when processing completes.

Key Concepts¶

Asynchronous Processing¶

When you submit a batch job, your files enter a processing queue. The API returns immediately with a batch ID, allowing your application to continue other operations while processing happens in the background.

Batch Lifecycle¶

Creation: Submit files with configuration options
Processing: Files are processed asynchronously
Monitoring: Check status to track progress
Completion: Retrieve results when processing is finished

File Lifecycle¶

When batch processing files:

Upload: Original files are temporarily stored for processing
Processing: Files are analyzed and vector embeddings are generated
Deletion: Original files are permanently deleted after processing
Storage: The vectors and rich metadata are stored in your storage for retrieval and searching

Important: The Wickson API only stores the derived vector data and metadata - not your original files/content! Please plan accordingly and maintain your own copies of important files.

Batch Status States¶

created: Batch has been created but processing hasn't started
running: Batch is actively being processed
completed: All items have been processed
failed: Batch processing has encountered a critical error
cancelled: Batch processing was cancelled by user

Working with Batch Processing¶

Creating a Batch¶

import requests

API_KEY = "YOUR_API_KEY"
API_URL = "https://api.wickson.ai/v1/batches"

# Prepare headers
headers = {
    "X-Api-Key": API_KEY
}

# Create form data
data = {
    "force_overwrite": "false",
    "collection_id": "quarterly_reports"
}

# Use context managers to handle file opening and closing
with open('report1.pdf', 'rb') as f1, open('report2.pdf', 'rb') as f2, open('report3.pdf', 'rb') as f3:
    # Prepare files
    files = [
        ('file', ('report1.pdf', f1)),
        ('file', ('report2.pdf', f2)),
        ('file', ('report3.pdf', f3))
    ]

    # Submit batch
    response = requests.post(API_URL, headers=headers, data=data, files=files)
    response_data = response.json()

# Get batch ID for tracking
batch_id = response_data["data"]["batch_id"]
print(f"Batch created: {batch_id}")
print(f"Estimated completion: {response_data['data']['estimated_completion']}")

Monitoring Batch Status¶

import requests
import time

def check_batch_status(batch_id, api_key):
    status_url = f"https://api.wickson.ai/v1/batches/{batch_id}"
    headers = {"X-Api-Key": api_key}

    response = requests.get(
        status_url, 
        headers=headers,
        params={"include_job_details": True}
    )
    status_data = response.json()["data"]

    return status_data

# Poll status until complete
while True:
    status_data = check_batch_status(batch_id, API_KEY)

    completion_percentage = status_data["progress"]["percentage"]
    status = status_data["status"]

    print(f"Status: {status} - {completion_percentage:.1f}% complete")

    if status in ["completed", "failed", "cancelled"]:
        break

    # Wait before checking again
    time.sleep(10)

Retrieving Batch Results¶

import requests

def get_batch_results(batch_id, api_key):
    results_url = f"https://api.wickson.ai/v1/batches/{batch_id}/results"
    headers = {"X-Api-Key": api_key}

    response = requests.get(results_url, headers=headers)
    results = response.json()["data"]

    return results

# Get detailed results
results = get_batch_results(batch_id, API_KEY)

# Print summary
print(f"Batch: {results['batch_summary']['batch_id']}")
print(f"Total items: {results['batch_summary']['total_items']}")
print(f"Successful: {results['batch_summary']['successful']}")
print(f"Failed: {results['batch_summary']['failed']}")

# Process successful results
for item in results["results"]:
    if item["status"] == "completed":
        print(f"Processed: {item['file_info']['filename']}")
        print(f"Media ID: {item['media_id']}")
        print(f"Type: {item['media_type']}")
        print("-" * 40)

# Handle failures if any
if results["batch_summary"]["failed"] > 0:
    print("\nFailed items:")
    for item in results["results"]:
        if item["status"] == "failed":
            print(f"File: {item['file_path']}")
            print(f"Error: {item['error']['message'] if 'error' in item else 'Unknown error'}")
            print("-" * 40)

Cancelling a Batch¶

import requests

def cancel_batch(batch_id, api_key):
    cancel_url = f"https://api.wickson.ai/v1/batches/{batch_id}/state"
    headers = {
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    }

    # Simple PUT request
    response = requests.put(cancel_url, headers=headers)
    return response.json()

# Cancel an in-progress batch
result = cancel_batch(batch_id, API_KEY)
print(f"Batch {result['data']['batch_id']} state changed:")
print(f"Previous: {result['data']['previous_state']}, Current: {result['data']['current_state']}")

Best Practices¶

Optimizing Batch Size¶

Ideal Batch Size: - -5-20 items per batch for optimal balance between efficiency and manageability
Maximum Size: 25 files per batch operation or you have reached 100 MB of data
Similar Media Types: Group similar types together for better processing efficiency

Effective Organization¶

Collection Strategy: Use logical collections to organize related content
Batch Naming: The system assigns batch names, but you can trace batches by collection and timestamps
Error Management: Plan for handling both complete batch failures and individual item failures

Resource Management¶

Parallelism: System automatically manages optimal parallelism for your batch
Preprocessing: Validate files before submission to avoid errors
Balance Checking: Ensure sufficient account balance for all batch operations

Error Handling¶

Implementing Robust Error Handling¶

import requests
import time

def process_batch_with_retry(files, collection_id, api_key, max_retries=3):
    # Submission with retry
    for attempt in range(max_retries):
        try:
            # Prepare submission
            headers = {"X-Api-Key": api_key}
            data = {"collection_id": collection_id}

            # Prepare files
            files_data = [('file', (f.name, open(f, 'rb'))) for f in files]

            # Submit batch
            response = requests.post(
                "https://api.wickson.ai/v1/batches",
                headers=headers,
                data=data,
                files=files_data,
                timeout=60  # Reasonable timeout
            )

            response.raise_for_status()
            return response.json()["data"]["batch_id"]

        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            wait_time = (2 ** attempt) + (0.1 * random.random())
            print(f"Attempt {attempt+1} failed, retrying in {wait_time:.1f} seconds")
            time.sleep(wait_time)

Common Error Scenarios¶

Error	Description	Resolution
`validation_error`	Invalid input parameters	Check file formats, batch size limits
`insufficient_funds`	Account balance too low	Add funds to cover estimated processing cost
`batch_too_large`	Exceeds maximum batch size	Split into smaller batches (max 25 files per batch)
`format_error`	Unsupported file format	Check supported formats list and convert files
`corrupted_file`	File integrity issues	Verify files before submission
`processing_error`	Error during media processing	Check specific error messages for details

Performance Considerations¶

Processing Time Factors¶

File Size: Larger files take longer to process
File Complexity: Complex documents, high-resolution images, and longer videos take more time
Media Type: Different media types have different processing characteristics:
Documents: Generally fastest (depends on page count)
Images: Fast to medium processing time
Audio: Medium processing time (depends on length)
Video: Longest processing time (depends on length and resolution)

Estimated Processing Times¶

Media Type	Average Processing Time	Factors
Document (10 pages)	~15 seconds	Page count, complexity, format
Image	~15 seconds	Resolution, complexity
Audio (3 minutes)	~30 seconds	Duration, audio quality
Video (2 minutes)	~30 seconds	Duration, resolution, complexity
3D Model	~30 seconds	Complexity, vertex count

Cost Considerations¶

Batch processing uses the same pricing model as individual processing:

Operation	Cost	Notes
Media Processing	$0.03 per file	Varies by media type
Database I/O	$0.01 per operation	One-time storage fee
Total per file	$0.04	No recurring charges

Batch operations have no additional costs beyond the per-file charges. Failed items are not charged.