Skip to content

Search & Retrieval

Search is the core interaction in IPTO. Buyers submit queries, and the platform retrieves relevant content from across the datasets they have access to -- ranking, deduplicating, and metering results along the way.

Retrieval modes

IPTO supports four retrieval modes that control how your query is matched against indexed content.

Mode Description
lexical Keyword-based search using BM25 ranking. Best for precise term matching, boolean queries, and structured query syntax.
dense Vector similarity search that matches on semantic meaning rather than exact keywords. Best for natural language questions.
hybrid Combines lexical and dense retrieval, merging results using Reciprocal Rank Fusion (RRF). Balances precision and recall.
auto The system selects the best mode based on query characteristics. This is the default.

Tip

Use auto unless you have a specific reason to force a retrieval mode. The system analyzes your query structure -- boolean operators and wildcards favor lexical, while natural language questions favor hybrid.

Search flow

When you submit a search request, it passes through the following stages:

flowchart TD
    A[Query submitted] --> B[Authenticate & resolve tenant]
    B --> C[Resolve accessible datasets]
    C --> D[Select retrieval mode]
    D --> E[Execute search]
    E --> F[Rank & deduplicate results]
    F --> G[Build snippets & citations]
    G --> H[Meter billable results]
    H --> I[Return response]

Each stage enforces access controls, applies filters, and ensures that only authorized, active content is returned.

Query syntax

IPTO supports a structured query syntax that gives you fine-grained control over how your search is interpreted.

Boolean operators

Combine terms with boolean logic to narrow or broaden your results.

Operator Syntax Example
AND A AND B or A B revenue AND forecast
OR A OR B CEO OR "chief executive"
NOT NOT A or -A contract NOT termination
Grouping (A OR B) AND C (apple OR google) AND lawsuit

Operator precedence is NOT > AND > OR. Use parentheses to override precedence.

Match an exact sequence of words by enclosing them in double quotes.

"chief executive officer"

Phrases are position-aware and require terms to appear adjacent and in order.

Find terms that appear near each other using the NEAR/n operator, where n is the maximum distance in word positions.

apple NEAR/5 lawsuit
"data breach" NEAR/10 notification

If no distance is specified, the default is NEAR/10.

Use * as a wildcard for prefix, suffix, or infix matching.

Pattern Matches
invest* invest, investor, investment, investing
*ization organization, optimization, monetization
c*o CEO, CFO, CTO

Note

Prefix wildcards require at least 3 characters before the *. Suffix and infix wildcards are more computationally expensive -- use them sparingly.

Target specific fields in the indexed content.

title:"quarterly report"
body:compliance AND tags:gdpr

Available fields: title, body, ocr, transcript, caption, metadata, tags.

Filters

Filters narrow results without affecting relevance scoring. Filters combine with logical AND; values within a single filter use logical OR.

Filter Type Description
mime_types string array Filter by file type (e.g., ["application/pdf", "image/png"]).
languages string array Filter by content language (e.g., ["en", "de"]).
created_at_gte timestamp Only include objects created on or after this date.
created_at_lte timestamp Only include objects created on or before this date.
tags_any string array Include objects matching any of the specified tags.
object_ids string array Restrict results to specific objects by ID.

Result structure

Each search result contains the information you need to use, cite, and attribute the retrieved content.

Snippets

A snippet is a highlighted text excerpt from the matched content. Snippets show the relevant portion of the indexed text with matched terms emphasized. They are generated from stored indexed content and do not require fetching the original file.

Citations

A citation provides a precise locator for the matched content within the original object. Citations include:

  • The object_id and dataset_id for attribution.
  • A locator that identifies the exact position -- for example, a page number in a PDF, a timestamp range in a transcript, or a chunk ordinal in a text document.
  • A display_text suitable for rendering in a citation list.

Scores

Each result includes a relevance score and a rank position. Scores are comparable within a single query but not across different queries. Results are ordered by descending score.

Billable results

Each result includes a billable flag indicating whether it counts toward metered usage. Only billable results generate retrieval charges. Duplicate results within the same query session are charged only once.


FAQ

What is the maximum number of results I can request?

You can request up to 100 results per query using the top_k parameter. The default is 20. For larger result sets, use cursor-based pagination to page through additional results.

Can I search across datasets from multiple providers in a single query?

Yes. When you omit the dataset_ids parameter or pass an empty array, the search spans all datasets accessible to your tenant. Results from different providers are merged and ranked together.

Does the retrieval mode affect billing?

No. Billing is based on the number of billable results returned, not the retrieval mode used. Whether you use lexical, dense, hybrid, or auto, the metering is the same.

What happens if I search for a term that does not exist in any dataset?

The API returns a successful response with an empty results array. No retrieval events are created and nothing is billed.

Can I combine filters with boolean query syntax?

Yes. Filters and query syntax work together. The query syntax controls how terms are matched, while filters restrict the candidate set. For example, you can use a boolean query like revenue AND forecast with a filter for mime_types: ["application/pdf"] to search only PDF documents.