Datasets¶

A dataset is a collection of related data objects that you publish to the IPTO marketplace. Datasets are the primary unit of organization, discovery, and monetization.

What is a dataset?¶

A dataset groups related files -- documents, images, audio, video, or a mix -- into a single logical entity that buyers can discover and search. Each dataset belongs to exactly one tenant and has its own configuration for visibility, monetization, and pricing.

Examples of datasets:

A corpus of legal filings related to a specific jurisdiction.
A collection of product specification sheets.
An archive of scanned invoices and receipts.
A library of training videos with transcripts.

Source modality¶

Every dataset declares a source modality that describes the primary type of content it contains.

Modality	Description
`document`	Text-based files such as PDFs, Word documents, and plain text.
`image`	Image files such as JPEG, PNG, and TIFF. Includes scanned documents with OCR content.
`video`	Video files with optional transcripts and shot-level metadata.
`audio`	Audio files with optional transcripts.
`mixed`	A combination of multiple modalities within a single dataset.

The modality affects how objects in the dataset are processed, indexed, and presented in search results.

Monetization modes¶

The monetization mode controls the commercial posture of your dataset -- how it is priced and distributed to buyers.

Mode	Description
`open`	Lower retrieval pricing intended for broader access and higher volume. Suitable for datasets where wide distribution maximizes value.
`premium`	Higher retrieval pricing with tighter access expectations. Suitable for high-value or specialized datasets.
`outcome_share`	Lower upfront retrieval pricing paired with revenue sharing when buyer workflows create measurable downstream value.

Info

Monetization mode and pricing model are separate concepts. The monetization mode sets the commercial posture; the pricing model determines how prices evolve over time.

Pricing models¶

The pricing model determines how retrieval pricing changes over the life of the dataset.

Model	Description
`fixed`	Stable per-event pricing that remains constant until manually changed by the provider.
`time_decay`	Pricing declines as the dataset ages or reaches saturation. Useful for time-sensitive data.
`demand_curve`	Pricing adapts dynamically based on observed demand. This is the default for new datasets.

Visibility¶

The visibility setting controls who can discover and access your dataset in the marketplace.

Visibility	Description
`private`	Visible only within your own tenant. Not discoverable by buyers.
`listed`	Discoverable by all authorized buyers in the marketplace.
`restricted`	Discoverable only after explicit approval. Buyers must request access and be granted permission.

Dataset lifecycle¶

A dataset progresses through a series of statuses as it is created, populated, published, and eventually retired.

stateDiagram-v2
    [*] --> draft
    draft --> ingesting
    ingesting --> active
    active --> paused
    paused --> active
    active --> suspended
    active --> pending_deletion
    pending_deletion --> deleted
    deleted --> [*]

Status	Description
`draft`	The dataset has been created but is not yet ingesting or published.
`ingesting`	Objects are being processed and indexed. The dataset is not yet searchable.
`active`	The dataset is live and searchable in the marketplace (subject to visibility settings).
`paused`	The dataset has been temporarily paused by the provider. It is not searchable while paused.
`suspended`	The dataset has been suspended by the platform (e.g., for policy reasons).
`pending_deletion`	A deletion request has been submitted. The dataset enters a grace period.
`deleted`	The dataset and its objects have been permanently removed. This state is terminal.

Dataset fields reference¶

The table below summarizes the key fields on a dataset resource.

Field	Type	Description
`dataset_id`	string	Unique identifier for the dataset (e.g., `dset_...`).
`tenant_id`	string	The tenant that owns this dataset.
`name`	string	Human-readable name for the dataset.
`description`	string	A longer description of the dataset contents.
`source_modality`	string	The primary content type: `document`, `image`, `video`, `audio`, or `mixed`.
`monetization_mode`	string	Commercial posture: `open`, `premium`, or `outcome_share`.
`pricing_model`	string	Pricing evolution strategy: `fixed`, `time_decay`, or `demand_curve`.
`visibility`	string	Discovery scope: `private`, `listed`, or `restricted`.
`status`	string	Current lifecycle status of the dataset.
`created_at`	timestamp	When the dataset was created.

FAQ¶

Can I change the monetization mode or pricing model after creating a dataset?

Yes. Both the monetization mode and pricing model can be updated on an active dataset. Changes take effect for future retrieval events; they do not retroactively affect already-metered usage.

What happens to buyer access when I pause a dataset?

A paused dataset is excluded from marketplace search results. Existing retrieval event references remain valid for billing purposes, but no new retrievals can occur until the dataset is resumed.

Can a dataset contain mixed file types even if the modality is set to document?

The modality is a declaration of the primary content type and influences processing behavior. While you can upload different file types, setting the correct modality ensures optimal extraction and indexing. Use mixed if your dataset genuinely spans multiple content types.

What is the difference between private and restricted visibility?

A private dataset is completely invisible to buyers -- it does not appear in catalog searches at all. A restricted dataset is discoverable in the catalog, but buyers must request and receive explicit approval before they can retrieve its contents.

Can I move a deleted dataset back to active?

No. Deletion is a terminal state. Once a dataset reaches deleted, it and all of its objects are permanently removed. If you need the data again, you must create a new dataset and re-upload the objects.