Similarity Search Guide¶
Introduction to Similarity Search¶
The Wickson API's similarity search capabilities let you discover content that's semantically similar to a specific media item in your collections. This powerful feature enables you to build recommendation systems, "more like this" functionality, and discover connections between your stored media across different types.
Unlike traditional search where you provide a text query, similarity search uses an existing media item as your "query" to find other related items based on semantic meaning, regardless of whether they share explicit keywords or tags.
When you want to find media that's conceptually related to an item you already have, similarity search is the perfect solution. It's ideal for recommendation engines, content discovery, and exploring relationships in your data.
Key Concepts¶
Vector-to-Vector Matching¶
Similarity search works by comparing the vector representation of your reference item with vectors of other items in your collections. Items positioned close to each other in the multidimensional vector space are considered semantically similar.
Similarity Scores and Explanations¶
Each result includes a similarity score (0.0-1.0) that indicates how closely it relates to your reference item. The API also provides human-readable explanations of why items match, helping you understand the relationships between your content.
Cross-Modal Connections¶
Similarity search can discover connections between different media types. For example, a document about climate change might be connected to images of melting glaciers, videos of extreme weather, or audio recordings of expert interviews.
Clustering¶
You can optionally group results into semantic clusters that represent different themes or aspects of similarity. This helps organize results into meaningful categories and discover diverse, but related content.
Working with Similarity Search¶
Basic Similarity Search Example¶
Python¶
import requests
# Configuration
api_key = "YOUR_API_KEY"
# Create similarity search request
search_data = {
"media_id": "vec-a1b2c3d4", # ID of your reference item
"collections": "research-papers", # Collection to search within
"max_results": 10, # Number of similar items to return
"min_score": 0.7 # Minimum similarity threshold (0.0-1.0)
}
# Execute search
response = requests.post(
"https://api.wickson.ai/v1/search/similar",
headers={
"X-Api-Key": api_key,
"Content-Type": "application/json"
},
json=search_data
)
# Process response
if response.status_code == 200:
data = response.json()["data"]
# Print search information
print(f"Similar to: {data['meta']['reference_item']}")
print(f"Found {data['meta']['stats']['total_results']} similar items")
# Display results
for i, result in enumerate(data["results"], 1):
print(f"\n{i}. {result['metadata']['file_info']['filename']} (Similarity: {result['similarity_percentage']})")
print(f" {result['relevance_explanation']}")
print(f" Media type: {result['metadata']['media_type']}")
else:
print(f"Error {response.status_code}: {response.text}")
cURL¶
curl -X POST https://api.wickson.ai/v1/search/similar \
-H "X-Api-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"media_id": "vec-a1b2c3d4",
"collections": "research-papers",
"max_results": 10,
"min_score": 0.7
}'
Cross-Collection Similarity Search¶
To discover similar content across multiple collections:
import requests
# Configuration
api_key = "YOUR_API_KEY"
# Create search request for multiple collections
search_data = {
"media_id": "vec-a1b2c3d4", # ID of your reference item
"collections": ["research", "reports", "news"], # Multiple collections to search
"max_results": 15, # Number of similar items to return
"min_score": 0.65, # More permissive threshold for broader results
"cluster": true # Group results into semantic clusters
}
# Execute search
response = requests.post(
"https://api.wickson.ai/v1/search/similar",
headers={
"X-Api-Key": api_key,
"Content-Type": "application/json"
},
json=search_data
)
# Process response with collection information
if response.status_code == 200:
data = response.json()["data"]
# Print distribution across collections
collections = data["meta"]["collections"]["result_distribution"]
print("Results by collection:")
for collection, count in collections.items():
print(f"- {collection}: {count} items")
# Display results organized by cluster
if "relationships" in data and "clusters" in data["relationships"]:
clusters = data["relationships"]["clusters"]["clusters"]
cluster_metadata = data["relationships"]["clusters"]["metadata"]
print("\nResults by semantic cluster:")
for cluster_name, item_ids in clusters.items():
print(f"\n{cluster_name.upper()} CLUSTER ({len(item_ids)} items):")
for item_id in item_ids:
# Find this item in results
item = next((r for r in data["results"] if r["media_id"] == item_id), None)
if item:
print(f"- {item['metadata']['file_info']['filename']} ({item['similarity_percentage']})")
else:
print(f"Error {response.status_code}: {response.text}")
Building a Recommendation System¶
Here's how to implement "more like this" functionality for content recommendations:
import requests
def get_similar_items(media_id, api_key, max_recommendations=6):
"""Get content recommendations similar to the specified item"""
headers = {
"X-Api-Key": api_key,
"Content-Type": "application/json"
}
# Create recommendation request
data = {
"media_id": media_id,
"collections": "all", # Search across all collections
"max_results": max_recommendations,
"min_score": 0.75, # High threshold for quality recommendations
"include_reference": False, # Don't include the reference item itself
"modality_filter": None # Allow any media type for diverse recommendations
}
# Execute search
response = requests.post(
"https://api.wickson.ai/v1/search/similar",
headers=headers,
json=data
)
if response.status_code != 200:
print(f"Error {response.status_code}: {response.text}")
return []
# Process and format recommendations
recommendations = []
results = response.json()["data"]["results"]
for result in results:
recommendations.append({
"id": result["media_id"],
"title": result["metadata"]["file_info"]["filename"],
"similarity": result["similarity_percentage"],
"media_type": result["metadata"]["media_type"],
"description": result["metadata"]["search_metadata"].get("summary", ""),
"relevance": result["relevance_explanation"]
})
return recommendations
# Example usage
api_key = "YOUR_API_KEY"
recommendations = get_similar_items("vec-a1b2c3d4", api_key)
# Display recommendations
print("Recommended content:")
for i, rec in enumerate(recommendations, 1):
print(f"{i}. {rec['title']} ({rec['similarity']})")
print(f" {rec['description']}")
print(f" Relevance: {rec['relevance']}")
Filtering Similar Results by Media Type¶
To find similar content of a specific media type:
import requests
# Configuration
api_key = "YOUR_API_KEY"
# Search for similar images only
search_data = {
"media_id": "vec-a1b2c3d4", # ID of your reference item (any media type)
"collections": "all", # Search all collections
"modality_filter": "image", # Only return image results
"max_results": 12,
"min_score": 0.65
}
# Execute search
response = requests.post(
"https://api.wickson.ai/v1/search/similar",
headers={
"X-Api-Key": api_key,
"Content-Type": "application/json"
},
json=search_data
)
# Process image results
if response.status_code == 200:
data = response.json()["data"]
print(f"Found {len(data['results'])} similar images")
for i, result in enumerate(data["results"], 1):
print(f"\n{i}. {result['metadata']['file_info']['filename']} ({result['similarity_percentage']})")
# Display image-specific metadata
if "modality_metadata" in result["metadata"] and "visual" in result["metadata"]["modality_metadata"]:
visual = result["metadata"]["modality_metadata"]["visual"]
if "scene_type" in visual:
print(f" Scene type: {visual['scene_type']}")
if "visual_elements" in visual:
print(f" Contains: {', '.join(visual['visual_elements'])}")
else:
print(f"Error {response.status_code}: {response.text}")
Collection Targeting Strategies¶
The API offers three ways to target collections for similarity searches:
Single Collection Search¶
- Fastest performance - Ideal when you know which collection contains relevant content - Perfect for topic-specific recommendationsMultiple Collection Search¶
- Searches only specified collections - Balances performance and coverage - Great for related but distinct content groupsAll-Collection Search¶
- Most comprehensive but potentially slower - Discovers connections across your entire content library - Ideal for global content explorationUnderstanding Similarity Results¶
The similarity search response includes rich information about why content is similar:
Similarity Score and Explanation¶
Each result contains:
score: Similarity score from 0.0 to 1.0similarity_percentage: Human-friendly percentage (e.g., "87.5%")relevance_explanation: Human-readable explanation of the match
The explanation helps you understand why items are connected, for example:
This tells you that:
- The match is very strong (91.0% similarity)
- Both items are documents
- They share the topics "climate change" and "research"
Semantic Clusters¶
When using "cluster": true, results are grouped into semantic themes:
"relationships": {
"clusters": {
"clusters": {
"machine_learning": ["vec-e5f6g7h8", "vec-i9j0k1l2"],
"applications": ["vec-m3n4o5p6", "vec-q7r8s9t0"]
},
"metadata": {
"machine_learning": {
"modality": "document",
"avg_score": 0.88,
"member_count": 2
},
"applications": {
"modality": "mixed",
"avg_score": 0.82,
"member_count": 2
}
}
}
}
This structure shows:
- Results are grouped into two clusters: "machine_learning" and "applications"
- The "machine_learning" cluster contains document media (average similarity 88%)
- The "applications" cluster contains mixed media types (average similarity 82%)
Fine-Tuning Similarity Search¶
Adjusting Similarity Thresholds¶
The min_score parameter controls how similar items must be to appear in results:
| Value | Behavior | Best for |
|---|---|---|
| 0.85+ | Only extremely similar items | Precise matches and near-duplicates |
| 0.75-0.85 | Very similar items | Strong recommendations |
| 0.65-0.75 | Moderately similar items | Balanced recommendations |
| 0.50-0.65 | Broadly similar items | Discovery of related concepts |
Controlling Result Diversity with Clustering¶
Enable clustering to organize results into semantic themes:
search_data = {
"media_id": "vec-a1b2c3d4",
"collections": "research",
"cluster": True, # Enable semantic clustering
"max_results": 20 # Get more results for diverse clusters
}
This is especially useful for:
- Building diverse recommendation panels
- Understanding different aspects of similarity
- Discovering conceptual relationships
Best Practices¶
Use Cases for Similarity Search¶
The similarity search API is ideal for:
- Content Recommendations - "People who viewed this also viewed..."
- Related Content Discovery - Finding content similar to what a user is currently viewing
- Content Organization - Automatically grouping related content
- Duplicate Detection - Finding near-duplicate content across collections
- Cross-Modal Discovery - Finding connections between different media types
Effective Collection Strategy¶
- Logical Grouping - Organize collections based on content purpose
- Balanced Size - Aim for collections with 100-10,000 items for optimal performance
- Media Type Separation - Consider separate collections for different media types when appropriate
Performance Optimization¶
- Target Specific Collections - Search fewer collections for faster results
- Set Appropriate Score Thresholds - Use higher thresholds for precision, lower for recall
- Limit Response Size - Use appropriate
max_resultsfor your application's needs
Building Effective Recommendation Systems¶
- Two-Tier Recommendations - Combine results from a strict threshold (0.8+) and a permissive threshold (0.65+)
- Diversify with Clusters - Show recommendations from different semantic clusters
- Cross-Modal Recommendations - Show related items across different media types
Example implementation:
import requests
def get_diverse_recommendations(media_id, api_key):
"""Get diverse recommendations across different semantic clusters"""
headers = {
"X-Api-Key": api_key,
"Content-Type": "application/json"
}
# Request with clustering enabled
data = {
"media_id": media_id,
"collections": "all",
"max_results": 20,
"min_score": 0.65,
"cluster": True,
"include_reference": False
}
response = requests.post(
"https://api.wickson.ai/v1/search/similar",
headers=headers,
json=data
)
if response.status_code != 200:
return []
result_data = response.json()["data"]
# Group by cluster for diverse recommendations
if "relationships" not in result_data or "clusters" not in result_data["relationships"]:
return result_data["results"][:10] # Fallback to top 10 without clustering
# Get cluster information
clusters = result_data["relationships"]["clusters"]["clusters"]
# Build diverse recommendations by taking top items from each cluster
diverse_recommendations = []
# For each cluster, add the highest-scoring item
for cluster_name, item_ids in clusters.items():
if len(item_ids) > 0:
# Find items in this cluster
cluster_items = [r for r in result_data["results"] if r["media_id"] in item_ids]
# Sort by score
cluster_items.sort(key=lambda x: x["score"], reverse=True)
# Add top item from cluster
if cluster_items:
diverse_recommendations.append(cluster_items[0])
# If we need more items to reach desired count
if len(diverse_recommendations) < 10:
# Add remaining top-scoring items not already included
added_ids = {r["media_id"] for r in diverse_recommendations}
for result in result_data["results"]:
if result["media_id"] not in added_ids and len(diverse_recommendations) < 10:
diverse_recommendations.append(result)
added_ids.add(result["media_id"])
return diverse_recommendations
# Example usage
api_key = "YOUR_API_KEY"
diverse_recommendations = get_diverse_recommendations("vec-a1b2c3d4", api_key)
Cost Considerations¶
| Operation | Cost | Notes |
|---|---|---|
| Similarity Search | $0.01 per request | Same cost regardless of number of collections searched |
The cost remains the same regardless of how many collections you search. This makes it cost-effective to search across your entire content library.
Troubleshooting¶
| Issue | Solutions |
|---|---|
Reference media item not found |
|
| No results returned |
|
| Too many unrelated results |
|
| Only finding same media type |
|
| Slow response times |
|
Comparison: Regular Search vs. Similarity Search¶
| Feature | Regular Search | Similarity Search |
|---|---|---|
| Input | Text query | Media item ID |
| Best for | Finding content matching specific criteria | Finding content similar to an existing item |
| Use case | "Find all documents about machine learning" | "Find content similar to this specific document" |
| Primary mechanism | Query-to-vector matching | Vector-to-vector comparison |
When building applications, you'll often use these search methods together:
- Use regular search for initial content discovery
- Then offer similarity search for "more like this" functionality
Similarity Search vs. Advanced R3F Search¶
Similarity search complements the Advanced R3F search in your application workflow:
| Similarity Search | Advanced R3F Search |
|---|---|
| Begins with a specific item | Begins with a text query |
| Focuses on finding similar items | Explores contextual connections and relationships |
| Perfect for recommendations | Perfect for research and exploration |
| Item-centric discovery | Query-centric discovery |