Skip to content

Datasets

A dataset is a collection of related data objects that you publish to the IPTO marketplace. Datasets are the primary unit of organization, discovery, and monetization.

What is a dataset?

A dataset groups related files -- documents, images, audio, video, or a mix -- into a single logical entity that buyers can discover and search. Each dataset belongs to exactly one tenant and has its own configuration for visibility, monetization, and pricing.

Examples of datasets:

  • A corpus of legal filings related to a specific jurisdiction.
  • A collection of product specification sheets.
  • An archive of scanned invoices and receipts.
  • A library of training videos with transcripts.

Source modality

Every dataset declares a source modality that describes the primary type of content it contains.

Modality Description
document Text-based files such as PDFs, Word documents, and plain text.
image Image files such as JPEG, PNG, and TIFF. Includes scanned documents with OCR content.
video Video files with optional transcripts and shot-level metadata.
audio Audio files with optional transcripts.
mixed A combination of multiple modalities within a single dataset.

The modality affects how objects in the dataset are processed, indexed, and presented in search results.

Monetization modes

The monetization mode controls the commercial posture of your dataset -- how it is priced and distributed to buyers.

Mode Description
open Lower retrieval pricing intended for broader access and higher volume. Suitable for datasets where wide distribution maximizes value.
premium Higher retrieval pricing with tighter access expectations. Suitable for high-value or specialized datasets.
outcome_share Lower upfront retrieval pricing paired with revenue sharing when buyer workflows create measurable downstream value.

Info

Monetization mode and pricing model are separate concepts. The monetization mode sets the commercial posture; the pricing model determines how prices evolve over time.

Pricing models

The pricing model determines how retrieval pricing changes over the life of the dataset.

Model Description
fixed Stable per-event pricing that remains constant until manually changed by the provider.
time_decay Pricing declines as the dataset ages or reaches saturation. Useful for time-sensitive data.
demand_curve Pricing adapts dynamically based on observed demand. This is the default for new datasets.

Visibility

The visibility setting controls who can discover and access your dataset in the marketplace.

Visibility Description
private Visible only within your own tenant. Not discoverable by buyers.
listed Discoverable by all authorized buyers in the marketplace.
restricted Discoverable only after explicit approval. Buyers must request access and be granted permission.

Dataset lifecycle

A dataset progresses through a series of statuses as it is created, populated, published, and eventually retired.

stateDiagram-v2
    [*] --> draft
    draft --> ingesting
    ingesting --> active
    active --> paused
    paused --> active
    active --> suspended
    active --> pending_deletion
    pending_deletion --> deleted
    deleted --> [*]
Status Description
draft The dataset has been created but is not yet ingesting or published.
ingesting Objects are being processed and indexed. The dataset is not yet searchable.
active The dataset is live and searchable in the marketplace (subject to visibility settings).
paused The dataset has been temporarily paused by the provider. It is not searchable while paused.
suspended The dataset has been suspended by the platform (e.g., for policy reasons).
pending_deletion A deletion request has been submitted. The dataset enters a grace period.
deleted The dataset and its objects have been permanently removed. This state is terminal.

Dataset fields reference

The table below summarizes the key fields on a dataset resource.

Field Type Description
dataset_id string Unique identifier for the dataset (e.g., dset_...).
tenant_id string The tenant that owns this dataset.
name string Human-readable name for the dataset.
description string A longer description of the dataset contents.
source_modality string The primary content type: document, image, video, audio, or mixed.
monetization_mode string Commercial posture: open, premium, or outcome_share.
pricing_model string Pricing evolution strategy: fixed, time_decay, or demand_curve.
visibility string Discovery scope: private, listed, or restricted.
status string Current lifecycle status of the dataset.
created_at timestamp When the dataset was created.

FAQ

Can I change the monetization mode or pricing model after creating a dataset?

Yes. Both the monetization mode and pricing model can be updated on an active dataset. Changes take effect for future retrieval events; they do not retroactively affect already-metered usage.

What happens to buyer access when I pause a dataset?

A paused dataset is excluded from marketplace search results. Existing retrieval event references remain valid for billing purposes, but no new retrievals can occur until the dataset is resumed.

Can a dataset contain mixed file types even if the modality is set to document?

The modality is a declaration of the primary content type and influences processing behavior. While you can upload different file types, setting the correct modality ensures optimal extraction and indexing. Use mixed if your dataset genuinely spans multiple content types.

What is the difference between private and restricted visibility?

A private dataset is completely invisible to buyers -- it does not appear in catalog searches at all. A restricted dataset is discoverable in the catalog, but buyers must request and receive explicit approval before they can retrieve its contents.

Can I move a deleted dataset back to active?

No. Deletion is a terminal state. Once a dataset reaches deleted, it and all of its objects are permanently removed. If you need the data again, you must create a new dataset and re-upload the objects.