Search & Retrieval¶
Search is the core interaction in IPTO. Buyers submit queries, and the platform retrieves relevant content from across the datasets they have access to -- ranking, deduplicating, and metering results along the way.
Retrieval modes¶
IPTO supports four retrieval modes that control how your query is matched against indexed content.
| Mode | Description |
|---|---|
lexical | Keyword-based search using BM25 ranking. Best for precise term matching, boolean queries, and structured query syntax. |
dense | Vector similarity search that matches on semantic meaning rather than exact keywords. Best for natural language questions. |
hybrid | Combines lexical and dense retrieval, merging results using Reciprocal Rank Fusion (RRF). Balances precision and recall. |
auto | The system selects the best mode based on query characteristics. This is the default. |
Tip
Use auto unless you have a specific reason to force a retrieval mode. The system analyzes your query structure -- boolean operators and wildcards favor lexical, while natural language questions favor hybrid.
Search flow¶
When you submit a search request, it passes through the following stages:
flowchart TD
A[Query submitted] --> B[Authenticate & resolve tenant]
B --> C[Resolve accessible datasets]
C --> D[Select retrieval mode]
D --> E[Execute search]
E --> F[Rank & deduplicate results]
F --> G[Build snippets & citations]
G --> H[Meter billable results]
H --> I[Return response] Each stage enforces access controls, applies filters, and ensures that only authorized, active content is returned.
Query syntax¶
IPTO supports a structured query syntax that gives you fine-grained control over how your search is interpreted.
Boolean operators¶
Combine terms with boolean logic to narrow or broaden your results.
| Operator | Syntax | Example |
|---|---|---|
| AND | A AND B or A B | revenue AND forecast |
| OR | A OR B | CEO OR "chief executive" |
| NOT | NOT A or -A | contract NOT termination |
| Grouping | (A OR B) AND C | (apple OR google) AND lawsuit |
Operator precedence is NOT > AND > OR. Use parentheses to override precedence.
Phrase search¶
Match an exact sequence of words by enclosing them in double quotes.
Phrases are position-aware and require terms to appear adjacent and in order.
Proximity search¶
Find terms that appear near each other using the NEAR/n operator, where n is the maximum distance in word positions.
If no distance is specified, the default is NEAR/10.
Wildcard search¶
Use * as a wildcard for prefix, suffix, or infix matching.
| Pattern | Matches |
|---|---|
invest* | invest, investor, investment, investing |
*ization | organization, optimization, monetization |
c*o | CEO, CFO, CTO |
Note
Prefix wildcards require at least 3 characters before the *. Suffix and infix wildcards are more computationally expensive -- use them sparingly.
Field-scoped search¶
Target specific fields in the indexed content.
Available fields: title, body, ocr, transcript, caption, metadata, tags.
Filters¶
Filters narrow results without affecting relevance scoring. Filters combine with logical AND; values within a single filter use logical OR.
| Filter | Type | Description |
|---|---|---|
mime_types | string array | Filter by file type (e.g., ["application/pdf", "image/png"]). |
languages | string array | Filter by content language (e.g., ["en", "de"]). |
created_at_gte | timestamp | Only include objects created on or after this date. |
created_at_lte | timestamp | Only include objects created on or before this date. |
tags_any | string array | Include objects matching any of the specified tags. |
object_ids | string array | Restrict results to specific objects by ID. |
Result structure¶
Each search result contains the information you need to use, cite, and attribute the retrieved content.
Snippets¶
A snippet is a highlighted text excerpt from the matched content. Snippets show the relevant portion of the indexed text with matched terms emphasized. They are generated from stored indexed content and do not require fetching the original file.
Citations¶
A citation provides a precise locator for the matched content within the original object. Citations include:
- The
object_idanddataset_idfor attribution. - A
locatorthat identifies the exact position -- for example, a page number in a PDF, a timestamp range in a transcript, or a chunk ordinal in a text document. - A
display_textsuitable for rendering in a citation list.
Scores¶
Each result includes a relevance score and a rank position. Scores are comparable within a single query but not across different queries. Results are ordered by descending score.
Billable results
Each result includes a billable flag indicating whether it counts toward metered usage. Only billable results generate retrieval charges. Duplicate results within the same query session are charged only once.
FAQ¶
What is the maximum number of results I can request?
You can request up to 100 results per query using the top_k parameter. The default is 20. For larger result sets, use cursor-based pagination to page through additional results.
Can I search across datasets from multiple providers in a single query?
Yes. When you omit the dataset_ids parameter or pass an empty array, the search spans all datasets accessible to your tenant. Results from different providers are merged and ranked together.
Does the retrieval mode affect billing?
No. Billing is based on the number of billable results returned, not the retrieval mode used. Whether you use lexical, dense, hybrid, or auto, the metering is the same.
What happens if I search for a term that does not exist in any dataset?
The API returns a successful response with an empty results array. No retrieval events are created and nothing is billed.
Can I combine filters with boolean query syntax?
Yes. Filters and query syntax work together. The query syntax controls how terms are matched, while filters restrict the candidate set. For example, you can use a boolean query like revenue AND forecast with a filter for mime_types: ["application/pdf"] to search only PDF documents.