Similarity Search Guide¶

Introduction to Similarity Search¶

The Wickson API's similarity search capabilities let you discover content that's semantically similar to a specific media item in your collections. This powerful feature enables you to build recommendation systems, "more like this" functionality, and discover connections between your stored media across different types.

Unlike traditional search where you provide a text query, similarity search uses an existing media item as your "query" to find other related items based on semantic meaning, regardless of whether they share explicit keywords or tags.

When you want to find media that's conceptually related to an item you already have, similarity search is the perfect solution. It's ideal for recommendation engines, content discovery, and exploring relationships in your data.

Key Concepts¶

Vector-to-Vector Matching¶

Similarity search works by comparing the vector representation of your reference item with vectors of other items in your collections. Items positioned close to each other in the multidimensional vector space are considered semantically similar.

Similarity Scores and Explanations¶

Each result includes a similarity score (0.0-1.0) that indicates how closely it relates to your reference item. The API also provides human-readable explanations of why items match, helping you understand the relationships between your content.

Similarity search can discover connections between different media types. For example, a document about climate change might be connected to images of melting glaciers, videos of extreme weather, or audio recordings of expert interviews.

Clustering¶

You can optionally group results into semantic clusters that represent different themes or aspects of similarity. This helps organize results into meaningful categories and discover diverse, but related content.

Working with Similarity Search¶

Basic Similarity Search Example¶

Python¶

import requests

# Configuration
api_key = "YOUR_API_KEY"

# Create similarity search request
search_data = {
    "media_id": "vec-a1b2c3d4",      # ID of your reference item
    "collections": "research-papers", # Collection to search within
    "max_results": 10,                # Number of similar items to return
    "min_score": 0.7                  # Minimum similarity threshold (0.0-1.0)
}

# Execute search
response = requests.post(
    "https://api.wickson.ai/v1/search/similar",
    headers={
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    },
    json=search_data
)

# Process response
if response.status_code == 200:
    data = response.json()["data"]

    # Print search information
    print(f"Similar to: {data['meta']['reference_item']}")
    print(f"Found {data['meta']['stats']['total_results']} similar items")

    # Display results
    for i, result in enumerate(data["results"], 1):
        print(f"\n{i}. {result['metadata']['file_info']['filename']} (Similarity: {result['similarity_percentage']})")
        print(f"   {result['relevance_explanation']}")
        print(f"   Media type: {result['metadata']['media_type']}")
else:
    print(f"Error {response.status_code}: {response.text}")

cURL¶

curl -X POST https://api.wickson.ai/v1/search/similar \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "media_id": "vec-a1b2c3d4",
    "collections": "research-papers",
    "max_results": 10,
    "min_score": 0.7
  }'

Cross-Collection Similarity Search¶

To discover similar content across multiple collections:

import requests

# Configuration
api_key = "YOUR_API_KEY"

# Create search request for multiple collections
search_data = {
    "media_id": "vec-a1b2c3d4",                    # ID of your reference item
    "collections": ["research", "reports", "news"], # Multiple collections to search
    "max_results": 15,                             # Number of similar items to return
    "min_score": 0.65,                             # More permissive threshold for broader results 
    "cluster": true                                # Group results into semantic clusters
}

# Execute search
response = requests.post(
    "https://api.wickson.ai/v1/search/similar",
    headers={
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    },
    json=search_data
)

# Process response with collection information
if response.status_code == 200:
    data = response.json()["data"]

    # Print distribution across collections
    collections = data["meta"]["collections"]["result_distribution"]
    print("Results by collection:")
    for collection, count in collections.items():
        print(f"- {collection}: {count} items")

    # Display results organized by cluster
    if "relationships" in data and "clusters" in data["relationships"]:
        clusters = data["relationships"]["clusters"]["clusters"]
        cluster_metadata = data["relationships"]["clusters"]["metadata"]

        print("\nResults by semantic cluster:")
        for cluster_name, item_ids in clusters.items():
            print(f"\n{cluster_name.upper()} CLUSTER ({len(item_ids)} items):")
            for item_id in item_ids:
                # Find this item in results
                item = next((r for r in data["results"] if r["media_id"] == item_id), None)
                if item:
                    print(f"- {item['metadata']['file_info']['filename']} ({item['similarity_percentage']})")
else:
    print(f"Error {response.status_code}: {response.text}")

Building a Recommendation System¶

Here's how to implement "more like this" functionality for content recommendations:

import requests

def get_similar_items(media_id, api_key, max_recommendations=6):
    """Get content recommendations similar to the specified item"""

    headers = {
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    }

    # Create recommendation request
    data = {
        "media_id": media_id,
        "collections": "all",           # Search across all collections
        "max_results": max_recommendations,
        "min_score": 0.75,              # High threshold for quality recommendations
        "include_reference": False,     # Don't include the reference item itself
        "modality_filter": None         # Allow any media type for diverse recommendations
    }

    # Execute search
    response = requests.post(
        "https://api.wickson.ai/v1/search/similar",
        headers=headers,
        json=data
    )

    if response.status_code != 200:
        print(f"Error {response.status_code}: {response.text}")
        return []

    # Process and format recommendations
    recommendations = []
    results = response.json()["data"]["results"]

    for result in results:
        recommendations.append({
            "id": result["media_id"],
            "title": result["metadata"]["file_info"]["filename"],
            "similarity": result["similarity_percentage"],
            "media_type": result["metadata"]["media_type"],
            "description": result["metadata"]["search_metadata"].get("summary", ""),
            "relevance": result["relevance_explanation"]
        })

    return recommendations

# Example usage
api_key = "YOUR_API_KEY"
recommendations = get_similar_items("vec-a1b2c3d4", api_key)

# Display recommendations
print("Recommended content:")
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec['title']} ({rec['similarity']})")
    print(f"   {rec['description']}")
    print(f"   Relevance: {rec['relevance']}")

Filtering Similar Results by Media Type¶

To find similar content of a specific media type:

import requests

# Configuration
api_key = "YOUR_API_KEY"

# Search for similar images only
search_data = {
    "media_id": "vec-a1b2c3d4",      # ID of your reference item (any media type)
    "collections": "all",            # Search all collections
    "modality_filter": "image",      # Only return image results
    "max_results": 12,
    "min_score": 0.65
}

# Execute search
response = requests.post(
    "https://api.wickson.ai/v1/search/similar",
    headers={
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    },
    json=search_data
)

# Process image results
if response.status_code == 200:
    data = response.json()["data"]

    print(f"Found {len(data['results'])} similar images")

    for i, result in enumerate(data["results"], 1):
        print(f"\n{i}. {result['metadata']['file_info']['filename']} ({result['similarity_percentage']})")

        # Display image-specific metadata
        if "modality_metadata" in result["metadata"] and "visual" in result["metadata"]["modality_metadata"]:
            visual = result["metadata"]["modality_metadata"]["visual"]
            if "scene_type" in visual:
                print(f"   Scene type: {visual['scene_type']}")
            if "visual_elements" in visual:
                print(f"   Contains: {', '.join(visual['visual_elements'])}")
else:
    print(f"Error {response.status_code}: {response.text}")

Collection Targeting Strategies¶

The API offers three ways to target collections for similarity searches:

Single Collection Search¶

"collections": "research-papers"  # Search within a single collection

- Fastest performance - Ideal when you know which collection contains relevant content - Perfect for topic-specific recommendations

Multiple Collection Search¶

"collections": ["research", "reports", "news"]  # Search specific collections

- Searches only specified collections - Balances performance and coverage - Great for related but distinct content groups

All-Collection Search¶

"collections": "all"  # Search across your entire content library

- Most comprehensive but potentially slower - Discovers connections across your entire content library - Ideal for global content exploration

Understanding Similarity Results¶

The similarity search response includes rich information about why content is similar:

Similarity Score and Explanation¶

Each result contains:

score: Similarity score from 0.0 to 1.0
similarity_percentage: Human-friendly percentage (e.g., "87.5%")
relevance_explanation: Human-readable explanation of the match

The explanation helps you understand why items are connected, for example:

"Very similar (91.0%) | Same type: document | Shared topics: climate change, research"

This tells you that:

The match is very strong (91.0% similarity)
Both items are documents
They share the topics "climate change" and "research"

Semantic Clusters¶

When using "cluster": true, results are grouped into semantic themes:

"relationships": {
  "clusters": {
    "clusters": {
      "machine_learning": ["vec-e5f6g7h8", "vec-i9j0k1l2"],
      "applications": ["vec-m3n4o5p6", "vec-q7r8s9t0"]
    },
    "metadata": {
      "machine_learning": {
        "modality": "document",
        "avg_score": 0.88,
        "member_count": 2
      },
      "applications": {
        "modality": "mixed",
        "avg_score": 0.82,
        "member_count": 2
      }
    }
  }
}

This structure shows:

Results are grouped into two clusters: "machine_learning" and "applications"
The "machine_learning" cluster contains document media (average similarity 88%)
The "applications" cluster contains mixed media types (average similarity 82%)

Fine-Tuning Similarity Search¶

Adjusting Similarity Thresholds¶

The min_score parameter controls how similar items must be to appear in results:

Value	Behavior	Best for
0.85+	Only extremely similar items	Precise matches and near-duplicates
0.75-0.85	Very similar items	Strong recommendations
0.65-0.75	Moderately similar items	Balanced recommendations
0.50-0.65	Broadly similar items	Discovery of related concepts

Controlling Result Diversity with Clustering¶

Enable clustering to organize results into semantic themes:

search_data = {
    "media_id": "vec-a1b2c3d4",
    "collections": "research",
    "cluster": True,        # Enable semantic clustering
    "max_results": 20       # Get more results for diverse clusters
}

This is especially useful for:

Building diverse recommendation panels
Understanding different aspects of similarity
Discovering conceptual relationships

Best Practices¶

Use Cases for Similarity Search¶

The similarity search API is ideal for:

Content Recommendations - "People who viewed this also viewed..."
Related Content Discovery - Finding content similar to what a user is currently viewing
Content Organization - Automatically grouping related content
Duplicate Detection - Finding near-duplicate content across collections
Cross-Modal Discovery - Finding connections between different media types

Effective Collection Strategy¶

Logical Grouping - Organize collections based on content purpose
Balanced Size - Aim for collections with 100-10,000 items for optimal performance
Media Type Separation - Consider separate collections for different media types when appropriate

Performance Optimization¶

Target Specific Collections - Search fewer collections for faster results
Set Appropriate Score Thresholds - Use higher thresholds for precision, lower for recall
Limit Response Size - Use appropriate max_results for your application's needs

Building Effective Recommendation Systems¶

Two-Tier Recommendations - Combine results from a strict threshold (0.8+) and a permissive threshold (0.65+)
Diversify with Clusters - Show recommendations from different semantic clusters
Cross-Modal Recommendations - Show related items across different media types

Example implementation:

import requests

def get_diverse_recommendations(media_id, api_key):
    """Get diverse recommendations across different semantic clusters"""

    headers = {
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    }

    # Request with clustering enabled
    data = {
        "media_id": media_id,
        "collections": "all",
        "max_results": 20,
        "min_score": 0.65,
        "cluster": True,
        "include_reference": False
    }

    response = requests.post(
        "https://api.wickson.ai/v1/search/similar",
        headers=headers,
        json=data
    )

    if response.status_code != 200:
        return []

    result_data = response.json()["data"]

    # Group by cluster for diverse recommendations
    if "relationships" not in result_data or "clusters" not in result_data["relationships"]:
        return result_data["results"][:10]  # Fallback to top 10 without clustering

    # Get cluster information
    clusters = result_data["relationships"]["clusters"]["clusters"]

    # Build diverse recommendations by taking top items from each cluster
    diverse_recommendations = []

    # For each cluster, add the highest-scoring item
    for cluster_name, item_ids in clusters.items():
        if len(item_ids) > 0:
            # Find items in this cluster
            cluster_items = [r for r in result_data["results"] if r["media_id"] in item_ids]
            # Sort by score
            cluster_items.sort(key=lambda x: x["score"], reverse=True)
            # Add top item from cluster
            if cluster_items:
                diverse_recommendations.append(cluster_items[0])

    # If we need more items to reach desired count
    if len(diverse_recommendations) < 10:
        # Add remaining top-scoring items not already included
        added_ids = {r["media_id"] for r in diverse_recommendations}
        for result in result_data["results"]:
            if result["media_id"] not in added_ids and len(diverse_recommendations) < 10:
                diverse_recommendations.append(result)
                added_ids.add(result["media_id"])

    return diverse_recommendations

# Example usage
api_key = "YOUR_API_KEY"
diverse_recommendations = get_diverse_recommendations("vec-a1b2c3d4", api_key)

Cost Considerations¶

Operation	Cost	Notes
Similarity Search	$0.01 per request	Same cost regardless of number of collections searched

The cost remains the same regardless of how many collections you search. This makes it cost-effective to search across your entire content library.

Troubleshooting¶

Issue	Solutions
`Reference media item not found`	Verify the media_id is correct Ensure you're searching the collection where the item exists Try using `"collections": "all"`
No results returned	Lower the `min_score` threshold Search more collections Remove any filters that might be too restrictive
Too many unrelated results	Increase the `min_score` threshold Add appropriate filters Search only in relevant collections
Only finding same media type	Remove any `modality_filter` Ensure you have diverse media types in your collections
Slow response times	Search fewer collections Reduce `max_results` Add specific filters to narrow the search space

Comparison: Regular Search vs. Similarity Search¶

Feature	Regular Search	Similarity Search
Input	Text query	Media item ID
Best for	Finding content matching specific criteria	Finding content similar to an existing item
Use case	"Find all documents about machine learning"	"Find content similar to this specific document"
Primary mechanism	Query-to-vector matching	Vector-to-vector comparison

When building applications, you'll often use these search methods together:

Use regular search for initial content discovery
Then offer similarity search for "more like this" functionality

Similarity Search vs. Advanced R3F Search¶

Similarity search complements the Advanced R3F search in your application workflow:

Similarity Search	Advanced R3F Search
Begins with a specific item	Begins with a text query
Focuses on finding similar items	Explores contextual connections and relationships
Perfect for recommendations	Perfect for research and exploration
Item-centric discovery	Query-centric discovery