Batch Processing Guide¶
Understanding Batch Processing¶
Batch processing allows you to process multiple media files in a single operation, offering significant efficiency for large-scale ingestion and management of your content. Unlike individual processing, batch operations happen asynchronously, allowing you to monitor progress and retrieve results when processing completes.
Key Concepts¶
Asynchronous Processing¶
When you submit a batch job, your files enter a processing queue. The API returns immediately with a batch ID, allowing your application to continue other operations while processing happens in the background.
Batch Lifecycle¶
- Creation: Submit files with configuration options
- Processing: Files are processed asynchronously
- Monitoring: Check status to track progress
- Completion: Retrieve results when processing is finished
File Lifecycle¶
When batch processing files:
- Upload: Original files are temporarily stored for processing
- Processing: Files are analyzed and vector embeddings are generated
- Deletion: Original files are permanently deleted after processing
- Storage: The vectors and rich metadata are stored in your storage for retrieval and searching
Important: The Wickson API only stores the derived vector data and metadata - not your original files/content! Please plan accordingly and maintain your own copies of important files.
Batch Status States¶
- created: Batch has been created but processing hasn't started
- running: Batch is actively being processed
- completed: All items have been processed
- failed: Batch processing has encountered a critical error
- cancelled: Batch processing was cancelled by user
Working with Batch Processing¶
Creating a Batch¶
import requests
API_KEY = "YOUR_API_KEY"
API_URL = "https://api.wickson.ai/v1/batches"
# Prepare headers
headers = {
"X-Api-Key": API_KEY
}
# Create form data
data = {
"force_overwrite": "false",
"collection_id": "quarterly_reports"
}
# Use context managers to handle file opening and closing
with open('report1.pdf', 'rb') as f1, open('report2.pdf', 'rb') as f2, open('report3.pdf', 'rb') as f3:
# Prepare files
files = [
('file', ('report1.pdf', f1)),
('file', ('report2.pdf', f2)),
('file', ('report3.pdf', f3))
]
# Submit batch
response = requests.post(API_URL, headers=headers, data=data, files=files)
response_data = response.json()
# Get batch ID for tracking
batch_id = response_data["data"]["batch_id"]
print(f"Batch created: {batch_id}")
print(f"Estimated completion: {response_data['data']['estimated_completion']}")
Monitoring Batch Status¶
import requests
import time
def check_batch_status(batch_id, api_key):
status_url = f"https://api.wickson.ai/v1/batches/{batch_id}"
headers = {"X-Api-Key": api_key}
response = requests.get(
status_url,
headers=headers,
params={"include_job_details": True}
)
status_data = response.json()["data"]
return status_data
# Poll status until complete
while True:
status_data = check_batch_status(batch_id, API_KEY)
completion_percentage = status_data["progress"]["percentage"]
status = status_data["status"]
print(f"Status: {status} - {completion_percentage:.1f}% complete")
if status in ["completed", "failed", "cancelled"]:
break
# Wait before checking again
time.sleep(10)
Retrieving Batch Results¶
import requests
def get_batch_results(batch_id, api_key):
results_url = f"https://api.wickson.ai/v1/batches/{batch_id}/results"
headers = {"X-Api-Key": api_key}
response = requests.get(results_url, headers=headers)
results = response.json()["data"]
return results
# Get detailed results
results = get_batch_results(batch_id, API_KEY)
# Print summary
print(f"Batch: {results['batch_summary']['batch_id']}")
print(f"Total items: {results['batch_summary']['total_items']}")
print(f"Successful: {results['batch_summary']['successful']}")
print(f"Failed: {results['batch_summary']['failed']}")
# Process successful results
for item in results["results"]:
if item["status"] == "completed":
print(f"Processed: {item['file_info']['filename']}")
print(f"Media ID: {item['media_id']}")
print(f"Type: {item['media_type']}")
print("-" * 40)
# Handle failures if any
if results["batch_summary"]["failed"] > 0:
print("\nFailed items:")
for item in results["results"]:
if item["status"] == "failed":
print(f"File: {item['file_path']}")
print(f"Error: {item['error']['message'] if 'error' in item else 'Unknown error'}")
print("-" * 40)
Cancelling a Batch¶
import requests
def cancel_batch(batch_id, api_key):
cancel_url = f"https://api.wickson.ai/v1/batches/{batch_id}/state"
headers = {
"X-Api-Key": api_key,
"Content-Type": "application/json"
}
# Simple PUT request
response = requests.put(cancel_url, headers=headers)
return response.json()
# Cancel an in-progress batch
result = cancel_batch(batch_id, API_KEY)
print(f"Batch {result['data']['batch_id']} state changed:")
print(f"Previous: {result['data']['previous_state']}, Current: {result['data']['current_state']}")
Best Practices¶
Optimizing Batch Size¶
- Ideal Batch Size: - -5-20 items per batch for optimal balance between efficiency and manageability
- Maximum Size: 25 files per batch operation or you have reached 100 MB of data
- Similar Media Types: Group similar types together for better processing efficiency
Effective Organization¶
- Collection Strategy: Use logical collections to organize related content
- Batch Naming: The system assigns batch names, but you can trace batches by collection and timestamps
- Error Management: Plan for handling both complete batch failures and individual item failures
Resource Management¶
- Parallelism: System automatically manages optimal parallelism for your batch
- Preprocessing: Validate files before submission to avoid errors
- Balance Checking: Ensure sufficient account balance for all batch operations
Error Handling¶
Implementing Robust Error Handling¶
import requests
import time
def process_batch_with_retry(files, collection_id, api_key, max_retries=3):
# Submission with retry
for attempt in range(max_retries):
try:
# Prepare submission
headers = {"X-Api-Key": api_key}
data = {"collection_id": collection_id}
# Prepare files
files_data = [('file', (f.name, open(f, 'rb'))) for f in files]
# Submit batch
response = requests.post(
"https://api.wickson.ai/v1/batches",
headers=headers,
data=data,
files=files_data,
timeout=60 # Reasonable timeout
)
response.raise_for_status()
return response.json()["data"]["batch_id"]
except requests.RequestException as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
wait_time = (2 ** attempt) + (0.1 * random.random())
print(f"Attempt {attempt+1} failed, retrying in {wait_time:.1f} seconds")
time.sleep(wait_time)
Common Error Scenarios¶
| Error | Description | Resolution |
|---|---|---|
validation_error |
Invalid input parameters | Check file formats, batch size limits |
insufficient_funds |
Account balance too low | Add funds to cover estimated processing cost |
batch_too_large |
Exceeds maximum batch size | Split into smaller batches (max 25 files per batch) |
format_error |
Unsupported file format | Check supported formats list and convert files |
corrupted_file |
File integrity issues | Verify files before submission |
processing_error |
Error during media processing | Check specific error messages for details |
Performance Considerations¶
Processing Time Factors¶
- File Size: Larger files take longer to process
- File Complexity: Complex documents, high-resolution images, and longer videos take more time
- Media Type: Different media types have different processing characteristics:
- Documents: Generally fastest (depends on page count)
- Images: Fast to medium processing time
- Audio: Medium processing time (depends on length)
- Video: Longest processing time (depends on length and resolution)
Estimated Processing Times¶
| Media Type | Average Processing Time | Factors |
|---|---|---|
| Document (10 pages) | ~15 seconds | Page count, complexity, format |
| Image | ~15 seconds | Resolution, complexity |
| Audio (3 minutes) | ~30 seconds | Duration, audio quality |
| Video (2 minutes) | ~30 seconds | Duration, resolution, complexity |
| 3D Model | ~30 seconds | Complexity, vertex count |
Cost Considerations¶
Batch processing uses the same pricing model as individual processing:
| Operation | Cost | Notes |
|---|---|---|
| Media Processing | $0.03 per file | Varies by media type |
| Database I/O | $0.01 per operation | One-time storage fee |
| Total per file | $0.04 | No recurring charges |
Batch operations have no additional costs beyond the per-file charges. Failed items are not charged.