Skip to content

Batch Processing Guide

Understanding Batch Processing

Batch processing allows you to process multiple media files in a single operation, offering significant efficiency for large-scale ingestion and management of your content. Unlike individual processing, batch operations happen asynchronously, allowing you to monitor progress and retrieve results when processing completes.

Key Concepts

Asynchronous Processing

When you submit a batch job, your files enter a processing queue. The API returns immediately with a batch ID, allowing your application to continue other operations while processing happens in the background.

Batch Lifecycle

  1. Creation: Submit files with configuration options
  2. Processing: Files are processed asynchronously
  3. Monitoring: Check status to track progress
  4. Completion: Retrieve results when processing is finished

File Lifecycle

When batch processing files:

  1. Upload: Original files are temporarily stored for processing
  2. Processing: Files are analyzed and vector embeddings are generated
  3. Deletion: Original files are permanently deleted after processing
  4. Storage: The vectors and rich metadata are stored in your storage for retrieval and searching

Important: The Wickson API only stores the derived vector data and metadata - not your original files/content! Please plan accordingly and maintain your own copies of important files.

Batch Status States

  • created: Batch has been created but processing hasn't started
  • running: Batch is actively being processed
  • completed: All items have been processed
  • failed: Batch processing has encountered a critical error
  • cancelled: Batch processing was cancelled by user

Working with Batch Processing

Creating a Batch

import requests

API_KEY = "YOUR_API_KEY"
API_URL = "https://api.wickson.ai/v1/batches"

# Prepare headers
headers = {
    "X-Api-Key": API_KEY
}

# Create form data
data = {
    "force_overwrite": "false",
    "collection_id": "quarterly_reports"
}

# Use context managers to handle file opening and closing
with open('report1.pdf', 'rb') as f1, open('report2.pdf', 'rb') as f2, open('report3.pdf', 'rb') as f3:
    # Prepare files
    files = [
        ('file', ('report1.pdf', f1)),
        ('file', ('report2.pdf', f2)),
        ('file', ('report3.pdf', f3))
    ]

    # Submit batch
    response = requests.post(API_URL, headers=headers, data=data, files=files)
    response_data = response.json()

# Get batch ID for tracking
batch_id = response_data["data"]["batch_id"]
print(f"Batch created: {batch_id}")
print(f"Estimated completion: {response_data['data']['estimated_completion']}")

Monitoring Batch Status

import requests
import time

def check_batch_status(batch_id, api_key):
    status_url = f"https://api.wickson.ai/v1/batches/{batch_id}"
    headers = {"X-Api-Key": api_key}

    response = requests.get(
        status_url, 
        headers=headers,
        params={"include_job_details": True}
    )
    status_data = response.json()["data"]

    return status_data

# Poll status until complete
while True:
    status_data = check_batch_status(batch_id, API_KEY)

    completion_percentage = status_data["progress"]["percentage"]
    status = status_data["status"]

    print(f"Status: {status} - {completion_percentage:.1f}% complete")

    if status in ["completed", "failed", "cancelled"]:
        break

    # Wait before checking again
    time.sleep(10)

Retrieving Batch Results

import requests

def get_batch_results(batch_id, api_key):
    results_url = f"https://api.wickson.ai/v1/batches/{batch_id}/results"
    headers = {"X-Api-Key": api_key}

    response = requests.get(results_url, headers=headers)
    results = response.json()["data"]

    return results

# Get detailed results
results = get_batch_results(batch_id, API_KEY)

# Print summary
print(f"Batch: {results['batch_summary']['batch_id']}")
print(f"Total items: {results['batch_summary']['total_items']}")
print(f"Successful: {results['batch_summary']['successful']}")
print(f"Failed: {results['batch_summary']['failed']}")

# Process successful results
for item in results["results"]:
    if item["status"] == "completed":
        print(f"Processed: {item['file_info']['filename']}")
        print(f"Media ID: {item['media_id']}")
        print(f"Type: {item['media_type']}")
        print("-" * 40)

# Handle failures if any
if results["batch_summary"]["failed"] > 0:
    print("\nFailed items:")
    for item in results["results"]:
        if item["status"] == "failed":
            print(f"File: {item['file_path']}")
            print(f"Error: {item['error']['message'] if 'error' in item else 'Unknown error'}")
            print("-" * 40)

Cancelling a Batch

import requests

def cancel_batch(batch_id, api_key):
    cancel_url = f"https://api.wickson.ai/v1/batches/{batch_id}/state"
    headers = {
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    }

    # Simple PUT request
    response = requests.put(cancel_url, headers=headers)
    return response.json()

# Cancel an in-progress batch
result = cancel_batch(batch_id, API_KEY)
print(f"Batch {result['data']['batch_id']} state changed:")
print(f"Previous: {result['data']['previous_state']}, Current: {result['data']['current_state']}")

Best Practices

Optimizing Batch Size

  • Ideal Batch Size: - -5-20 items per batch for optimal balance between efficiency and manageability
  • Maximum Size: 25 files per batch operation or you have reached 100 MB of data
  • Similar Media Types: Group similar types together for better processing efficiency

Effective Organization

  • Collection Strategy: Use logical collections to organize related content
  • Batch Naming: The system assigns batch names, but you can trace batches by collection and timestamps
  • Error Management: Plan for handling both complete batch failures and individual item failures

Resource Management

  • Parallelism: System automatically manages optimal parallelism for your batch
  • Preprocessing: Validate files before submission to avoid errors
  • Balance Checking: Ensure sufficient account balance for all batch operations

Error Handling

Implementing Robust Error Handling

import requests
import time

def process_batch_with_retry(files, collection_id, api_key, max_retries=3):
    # Submission with retry
    for attempt in range(max_retries):
        try:
            # Prepare submission
            headers = {"X-Api-Key": api_key}
            data = {"collection_id": collection_id}

            # Prepare files
            files_data = [('file', (f.name, open(f, 'rb'))) for f in files]

            # Submit batch
            response = requests.post(
                "https://api.wickson.ai/v1/batches",
                headers=headers,
                data=data,
                files=files_data,
                timeout=60  # Reasonable timeout
            )

            response.raise_for_status()
            return response.json()["data"]["batch_id"]

        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            wait_time = (2 ** attempt) + (0.1 * random.random())
            print(f"Attempt {attempt+1} failed, retrying in {wait_time:.1f} seconds")
            time.sleep(wait_time)

Common Error Scenarios

Error Description Resolution
validation_error Invalid input parameters Check file formats, batch size limits
insufficient_funds Account balance too low Add funds to cover estimated processing cost
batch_too_large Exceeds maximum batch size Split into smaller batches (max 25 files per batch)
format_error Unsupported file format Check supported formats list and convert files
corrupted_file File integrity issues Verify files before submission
processing_error Error during media processing Check specific error messages for details

Performance Considerations

Processing Time Factors

  • File Size: Larger files take longer to process
  • File Complexity: Complex documents, high-resolution images, and longer videos take more time
  • Media Type: Different media types have different processing characteristics:
  • Documents: Generally fastest (depends on page count)
  • Images: Fast to medium processing time
  • Audio: Medium processing time (depends on length)
  • Video: Longest processing time (depends on length and resolution)

Estimated Processing Times

Media Type Average Processing Time Factors
Document (10 pages) ~15 seconds Page count, complexity, format
Image ~15 seconds Resolution, complexity
Audio (3 minutes) ~30 seconds Duration, audio quality
Video (2 minutes) ~30 seconds Duration, resolution, complexity
3D Model ~30 seconds Complexity, vertex count

Cost Considerations

Batch processing uses the same pricing model as individual processing:

Operation Cost Notes
Media Processing $0.03 per file Varies by media type
Database I/O $0.01 per operation One-time storage fee
Total per file $0.04 No recurring charges

Batch operations have no additional costs beyond the per-file charges. Failed items are not charged.

This site uses cookies to help us improve the overall documentation and browsing experience. By continuing to use this site, you agree to our Privacy Policy.