Skip to content

Similarity Search Guide

The Wickson API's similarity search capabilities let you discover content that's semantically similar to a specific media item in your collections. This powerful feature enables you to build recommendation systems, "more like this" functionality, and discover connections between your stored media across different types.

Unlike traditional search where you provide a text query, similarity search uses an existing media item as your "query" to find other related items based on semantic meaning, regardless of whether they share explicit keywords or tags.

When you want to find media that's conceptually related to an item you already have, similarity search is the perfect solution. It's ideal for recommendation engines, content discovery, and exploring relationships in your data.

Key Concepts

Vector-to-Vector Matching

Similarity search works by comparing the vector representation of your reference item with vectors of other items in your collections. Items positioned close to each other in the multidimensional vector space are considered semantically similar.

Similarity Scores and Explanations

Each result includes a similarity score (0.0-1.0) that indicates how closely it relates to your reference item. The API also provides human-readable explanations of why items match, helping you understand the relationships between your content.

Cross-Modal Connections

Similarity search can discover connections between different media types. For example, a document about climate change might be connected to images of melting glaciers, videos of extreme weather, or audio recordings of expert interviews.

Clustering

You can optionally group results into semantic clusters that represent different themes or aspects of similarity. This helps organize results into meaningful categories and discover diverse, but related content.

Basic Similarity Search Example

Python

import requests

# Configuration
api_key = "YOUR_API_KEY"

# Create similarity search request
search_data = {
    "media_id": "vec-a1b2c3d4",      # ID of your reference item
    "collections": "research-papers", # Collection to search within
    "max_results": 10,                # Number of similar items to return
    "min_score": 0.7                  # Minimum similarity threshold (0.0-1.0)
}

# Execute search
response = requests.post(
    "https://api.wickson.ai/v1/search/similar",
    headers={
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    },
    json=search_data
)

# Process response
if response.status_code == 200:
    data = response.json()["data"]

    # Print search information
    print(f"Similar to: {data['meta']['reference_item']}")
    print(f"Found {data['meta']['stats']['total_results']} similar items")

    # Display results
    for i, result in enumerate(data["results"], 1):
        print(f"\n{i}. {result['metadata']['file_info']['filename']} (Similarity: {result['similarity_percentage']})")
        print(f"   {result['relevance_explanation']}")
        print(f"   Media type: {result['metadata']['media_type']}")
else:
    print(f"Error {response.status_code}: {response.text}")

cURL

curl -X POST https://api.wickson.ai/v1/search/similar \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "media_id": "vec-a1b2c3d4",
    "collections": "research-papers",
    "max_results": 10,
    "min_score": 0.7
  }'

To discover similar content across multiple collections:

import requests

# Configuration
api_key = "YOUR_API_KEY"

# Create search request for multiple collections
search_data = {
    "media_id": "vec-a1b2c3d4",                    # ID of your reference item
    "collections": ["research", "reports", "news"], # Multiple collections to search
    "max_results": 15,                             # Number of similar items to return
    "min_score": 0.65,                             # More permissive threshold for broader results 
    "cluster": true                                # Group results into semantic clusters
}

# Execute search
response = requests.post(
    "https://api.wickson.ai/v1/search/similar",
    headers={
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    },
    json=search_data
)

# Process response with collection information
if response.status_code == 200:
    data = response.json()["data"]

    # Print distribution across collections
    collections = data["meta"]["collections"]["result_distribution"]
    print("Results by collection:")
    for collection, count in collections.items():
        print(f"- {collection}: {count} items")

    # Display results organized by cluster
    if "relationships" in data and "clusters" in data["relationships"]:
        clusters = data["relationships"]["clusters"]["clusters"]
        cluster_metadata = data["relationships"]["clusters"]["metadata"]

        print("\nResults by semantic cluster:")
        for cluster_name, item_ids in clusters.items():
            print(f"\n{cluster_name.upper()} CLUSTER ({len(item_ids)} items):")
            for item_id in item_ids:
                # Find this item in results
                item = next((r for r in data["results"] if r["media_id"] == item_id), None)
                if item:
                    print(f"- {item['metadata']['file_info']['filename']} ({item['similarity_percentage']})")
else:
    print(f"Error {response.status_code}: {response.text}")

Building a Recommendation System

Here's how to implement "more like this" functionality for content recommendations:

import requests

def get_similar_items(media_id, api_key, max_recommendations=6):
    """Get content recommendations similar to the specified item"""

    headers = {
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    }

    # Create recommendation request
    data = {
        "media_id": media_id,
        "collections": "all",           # Search across all collections
        "max_results": max_recommendations,
        "min_score": 0.75,              # High threshold for quality recommendations
        "include_reference": False,     # Don't include the reference item itself
        "modality_filter": None         # Allow any media type for diverse recommendations
    }

    # Execute search
    response = requests.post(
        "https://api.wickson.ai/v1/search/similar",
        headers=headers,
        json=data
    )

    if response.status_code != 200:
        print(f"Error {response.status_code}: {response.text}")
        return []

    # Process and format recommendations
    recommendations = []
    results = response.json()["data"]["results"]

    for result in results:
        recommendations.append({
            "id": result["media_id"],
            "title": result["metadata"]["file_info"]["filename"],
            "similarity": result["similarity_percentage"],
            "media_type": result["metadata"]["media_type"],
            "description": result["metadata"]["search_metadata"].get("summary", ""),
            "relevance": result["relevance_explanation"]
        })

    return recommendations

# Example usage
api_key = "YOUR_API_KEY"
recommendations = get_similar_items("vec-a1b2c3d4", api_key)

# Display recommendations
print("Recommended content:")
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec['title']} ({rec['similarity']})")
    print(f"   {rec['description']}")
    print(f"   Relevance: {rec['relevance']}")

Filtering Similar Results by Media Type

To find similar content of a specific media type:

import requests

# Configuration
api_key = "YOUR_API_KEY"

# Search for similar images only
search_data = {
    "media_id": "vec-a1b2c3d4",      # ID of your reference item (any media type)
    "collections": "all",            # Search all collections
    "modality_filter": "image",      # Only return image results
    "max_results": 12,
    "min_score": 0.65
}

# Execute search
response = requests.post(
    "https://api.wickson.ai/v1/search/similar",
    headers={
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    },
    json=search_data
)

# Process image results
if response.status_code == 200:
    data = response.json()["data"]

    print(f"Found {len(data['results'])} similar images")

    for i, result in enumerate(data["results"], 1):
        print(f"\n{i}. {result['metadata']['file_info']['filename']} ({result['similarity_percentage']})")

        # Display image-specific metadata
        if "modality_metadata" in result["metadata"] and "visual" in result["metadata"]["modality_metadata"]:
            visual = result["metadata"]["modality_metadata"]["visual"]
            if "scene_type" in visual:
                print(f"   Scene type: {visual['scene_type']}")
            if "visual_elements" in visual:
                print(f"   Contains: {', '.join(visual['visual_elements'])}")
else:
    print(f"Error {response.status_code}: {response.text}")

Collection Targeting Strategies

The API offers three ways to target collections for similarity searches:

"collections": "research-papers"  # Search within a single collection
- Fastest performance - Ideal when you know which collection contains relevant content - Perfect for topic-specific recommendations

"collections": ["research", "reports", "news"]  # Search specific collections
- Searches only specified collections - Balances performance and coverage - Great for related but distinct content groups

"collections": "all"  # Search across your entire content library
- Most comprehensive but potentially slower - Discovers connections across your entire content library - Ideal for global content exploration

Understanding Similarity Results

The similarity search response includes rich information about why content is similar:

Similarity Score and Explanation

Each result contains:

  • score: Similarity score from 0.0 to 1.0
  • similarity_percentage: Human-friendly percentage (e.g., "87.5%")
  • relevance_explanation: Human-readable explanation of the match

The explanation helps you understand why items are connected, for example:

"Very similar (91.0%) | Same type: document | Shared topics: climate change, research"

This tells you that:

  • The match is very strong (91.0% similarity)
  • Both items are documents
  • They share the topics "climate change" and "research"

Semantic Clusters

When using "cluster": true, results are grouped into semantic themes:

"relationships": {
  "clusters": {
    "clusters": {
      "machine_learning": ["vec-e5f6g7h8", "vec-i9j0k1l2"],
      "applications": ["vec-m3n4o5p6", "vec-q7r8s9t0"]
    },
    "metadata": {
      "machine_learning": {
        "modality": "document",
        "avg_score": 0.88,
        "member_count": 2
      },
      "applications": {
        "modality": "mixed",
        "avg_score": 0.82,
        "member_count": 2
      }
    }
  }
}

This structure shows:

  • Results are grouped into two clusters: "machine_learning" and "applications"
  • The "machine_learning" cluster contains document media (average similarity 88%)
  • The "applications" cluster contains mixed media types (average similarity 82%)

Adjusting Similarity Thresholds

The min_score parameter controls how similar items must be to appear in results:

Value Behavior Best for
0.85+ Only extremely similar items Precise matches and near-duplicates
0.75-0.85 Very similar items Strong recommendations
0.65-0.75 Moderately similar items Balanced recommendations
0.50-0.65 Broadly similar items Discovery of related concepts

Controlling Result Diversity with Clustering

Enable clustering to organize results into semantic themes:

search_data = {
    "media_id": "vec-a1b2c3d4",
    "collections": "research",
    "cluster": True,        # Enable semantic clustering
    "max_results": 20       # Get more results for diverse clusters
}

This is especially useful for:

  • Building diverse recommendation panels
  • Understanding different aspects of similarity
  • Discovering conceptual relationships

Best Practices

The similarity search API is ideal for:

  1. Content Recommendations - "People who viewed this also viewed..."
  2. Related Content Discovery - Finding content similar to what a user is currently viewing
  3. Content Organization - Automatically grouping related content
  4. Duplicate Detection - Finding near-duplicate content across collections
  5. Cross-Modal Discovery - Finding connections between different media types

Effective Collection Strategy

  1. Logical Grouping - Organize collections based on content purpose
  2. Balanced Size - Aim for collections with 100-10,000 items for optimal performance
  3. Media Type Separation - Consider separate collections for different media types when appropriate

Performance Optimization

  1. Target Specific Collections - Search fewer collections for faster results
  2. Set Appropriate Score Thresholds - Use higher thresholds for precision, lower for recall
  3. Limit Response Size - Use appropriate max_results for your application's needs

Building Effective Recommendation Systems

  1. Two-Tier Recommendations - Combine results from a strict threshold (0.8+) and a permissive threshold (0.65+)
  2. Diversify with Clusters - Show recommendations from different semantic clusters
  3. Cross-Modal Recommendations - Show related items across different media types

Example implementation:

import requests

def get_diverse_recommendations(media_id, api_key):
    """Get diverse recommendations across different semantic clusters"""

    headers = {
        "X-Api-Key": api_key,
        "Content-Type": "application/json"
    }

    # Request with clustering enabled
    data = {
        "media_id": media_id,
        "collections": "all",
        "max_results": 20,
        "min_score": 0.65,
        "cluster": True,
        "include_reference": False
    }

    response = requests.post(
        "https://api.wickson.ai/v1/search/similar",
        headers=headers,
        json=data
    )

    if response.status_code != 200:
        return []

    result_data = response.json()["data"]

    # Group by cluster for diverse recommendations
    if "relationships" not in result_data or "clusters" not in result_data["relationships"]:
        return result_data["results"][:10]  # Fallback to top 10 without clustering

    # Get cluster information
    clusters = result_data["relationships"]["clusters"]["clusters"]

    # Build diverse recommendations by taking top items from each cluster
    diverse_recommendations = []

    # For each cluster, add the highest-scoring item
    for cluster_name, item_ids in clusters.items():
        if len(item_ids) > 0:
            # Find items in this cluster
            cluster_items = [r for r in result_data["results"] if r["media_id"] in item_ids]
            # Sort by score
            cluster_items.sort(key=lambda x: x["score"], reverse=True)
            # Add top item from cluster
            if cluster_items:
                diverse_recommendations.append(cluster_items[0])

    # If we need more items to reach desired count
    if len(diverse_recommendations) < 10:
        # Add remaining top-scoring items not already included
        added_ids = {r["media_id"] for r in diverse_recommendations}
        for result in result_data["results"]:
            if result["media_id"] not in added_ids and len(diverse_recommendations) < 10:
                diverse_recommendations.append(result)
                added_ids.add(result["media_id"])

    return diverse_recommendations

# Example usage
api_key = "YOUR_API_KEY"
diverse_recommendations = get_diverse_recommendations("vec-a1b2c3d4", api_key)

Cost Considerations

Operation Cost Notes
Similarity Search $0.01 per request Same cost regardless of number of collections searched

The cost remains the same regardless of how many collections you search. This makes it cost-effective to search across your entire content library.

Troubleshooting

Issue Solutions
Reference media item not found
  • Verify the media_id is correct
  • Ensure you're searching the collection where the item exists
  • Try using "collections": "all"
No results returned
  • Lower the min_score threshold
  • Search more collections
  • Remove any filters that might be too restrictive
Too many unrelated results
  • Increase the min_score threshold
  • Add appropriate filters
  • Search only in relevant collections
Only finding same media type
  • Remove any modality_filter
  • Ensure you have diverse media types in your collections
Slow response times
  • Search fewer collections
  • Reduce max_results
  • Add specific filters to narrow the search space
Feature Regular Search Similarity Search
Input Text query Media item ID
Best for Finding content matching specific criteria Finding content similar to an existing item
Use case "Find all documents about machine learning" "Find content similar to this specific document"
Primary mechanism Query-to-vector matching Vector-to-vector comparison

When building applications, you'll often use these search methods together:

  1. Use regular search for initial content discovery
  2. Then offer similarity search for "more like this" functionality

Similarity search complements the Advanced R3F search in your application workflow:

Similarity Search Advanced R3F Search
Begins with a specific item Begins with a text query
Focuses on finding similar items Explores contextual connections and relationships
Perfect for recommendations Perfect for research and exploration
Item-centric discovery Query-centric discovery
This site uses cookies to help us improve the overall documentation and browsing experience. By continuing to use this site, you agree to our Privacy Policy.