# How to use Datumaro
As a standalone tool or a Python module:
``` bash
datum --help
python -m datumaro --help
python datumaro/ --help
python datum.py --help
```
As a Python library:
``` python
from datumaro.components.project import Project
from datumaro.components.dataset import Dataset
from datumaro.components.extractor import Label, Bbox, DatasetItem
...
dataset = Dataset.import_from(path, format)
...
```
### Glossary
- Basic concepts:
- Dataset - A collection of dataset items, which consist of media and
associated annotations.
- Dataset item - A basic single element of the dataset. Also known as
"sample", "entry". In different datasets it can be an image, a video
frame, a whole video, a 3d point cloud etc. Typically, has corresponding
annotations.
- (Datumaro) Project - A combination of multiple datasets, plugins,
models and metadata.
- Project versioning concepts:
- Data source - A link to a dataset or a copy of a dataset inside a project.
Basically, a URL + dataset format name.
- Project revision - A commit or a reference from Git (branch, tag,
HEAD~3 etc.). A revision is referenced by data hash. The `HEAD`
revision is the currently selected revision of the project.
- Revision tree - A project build tree and plugins at
a specified revision.
- Working tree - The revision tree in the working directory of a project.
- data source revision - a state of a data source at a specific stage.
A revision is referenced by the data hash.
- Object - The data of a revision tree or a data source revision.
An object is referenced by the data hash.
- Dataset path concepts:
- Dataset revpath - A path to a dataset in a special format. They are
supposed to specify paths to files, directories or data source revisions
in a uniform way in the CLI.
- dataset path - a path to a dataset in the following format:
`:`
- `format` is optional. If not specified, will try to detect automatically
- **rev**ision path - a path to a data source revision in a project.
The syntax is:
`@:`, any part can be omitted.
- Default project is the current project (`-p`/`--project` CLI arg.)
Local revpaths imply that the current project is used and this part
should be omitted.
- Default revision is the working tree of the project
- Default build target is `project`
If a path refers to `project` (i.e. target name is not set, or
this target is exactly specified), the target dataset is the result of
[joining](/docs/developer_manual/#merging) all the project data sources.
Otherwise, if the path refers to a data source revision, the
corresponding stage from the revision build tree will be used.
- Dataset building concepts:
- Stage - A revision of a dataset - the original dataset or its modification
after transformation, filtration or something else. A build tree node.
A stage is referred by a name.
- Build tree - A directed graph (tree) with root nodes at data sources
and a single top node called `project`, which represents
a [joined](/docs/developer_manual/#merging) dataset.
Each data source has a starting `root` node, which corresponds to the
original dataset. The internal graph nodes are stages.
- Build target - A data source or a stage name. Data source names correspond
to the last stages of data sources.
- Pipeline - A subgraph of a stage, which includes all the ancestors.
- Other:
- Transform - A transformation operation over dataset elements. Examples
are image renaming, image flipping, image and subset renaming,
label remapping etc. Corresponds to the [`transform` command](/docs/user-manual/command-reference/transform).
### Command-line workflow
In Datumaro, most command-line commands operate on projects, but there are
also few commands operating on datasets directly. There are 2 basic ways
to use Datumaro from the command-line:
- Use the [`convert`](/docs/user-manual/command-reference/convert), [`diff`](/docs/user-manual/command-reference/diff), [`merge`](/docs/user-manual/command-reference/merge)
commands directly on existing datasets
- Create a Datumaro project and operate on it:
- Create an empty project with [`create`](/docs/user-manual/command-reference/create)
- Import existing datasets with [`import`](/docs/user-manual/command-reference/sources#source-import)
- Modify the project with [`transform`](/docs/user-manual/command-reference/transform) and [`filter`](/docs/user-manual/command-reference/filter)
- Create new revisions of the project with
[`commit`](/docs/user-manual/command-reference/commit), navigate over
them using [`checkout`](/docs/user-manual/command-reference/checkout),
compare with [`diff`](/docs/user-manual/command-reference/diff), compute
statistics with [`stats`](/docs/user-manual/command-reference/stats)
- Export the resulting dataset with [`export`](/docs/user-manual/command-reference/export)
- Check project config with [`project info`](/docs/user-manual/command-reference/projects/#print-project-info)
Basically, a project is a combination of datasets, models and environment.
A project can contain an arbitrary number of datasets ([data sources](/docs/user-manual/how_to_use_datumaro#data-sources)).
A project acts as a manager for them and allows to manipulate them
separately or as a whole, in which case it combines dataset items
from all the sources into one composite dataset. You can manage separate
datasets in a project by commands in the [`datum source`](/docs/user-manual/command-reference/sources)
command line context.
Note that **modifying operations** (`transform`, `filter`, `patch`)
**are applied in-place** to the datasets by default.
If you want to interact with models, you need to add them to the project
first using the [`model add`](/docs/user-manual/command-reference/models/#register-model) command.
A typical way to obtain Datumaro projects is to export tasks in
[CVAT](https://github.com/openvinotoolkit/cvat) UI.
### Project data model

Datumaro tries to combine a "Git for datasets" and a build system like
make or CMake for datasets in a single solution. Currently, `Project`
represents a Version Control System for datasets, which is based on Git and DVC
projects. Each project `Revision` describes a build tree of a dataset
with all the related metadata. A build tree consists of a number of data
sources and transformation stages. Each data source has its own set of build
steps (stages). Datumaro supposes copying of datasets and working in-place by
default. Modifying operations are recorded in the project, so any of the
dataset revisions can be reproduced when needed. Multiple dataset versions can
be stored in different branches with the common data shared.
Let's consider an example of a build tree:

There are 2 data sources in the example project. The resulting dataset
is obtained by simple merging (joining) the results of the input datasets.
"Source 1" and "Source 2" are the names of data sources in the project. Each
source has several stages with their own names. The first stage (called "root")
represents the original contents of a data source - the data at the
user-provided URL. The following stages represent operations, which needs to
be done with the data source to prepare the resulting dataset.
Roughly, such build tree can be created by the following commands (arguments
are omitted for simplicity):
``` bash
datum create
# describe the first source
datum import <...> -n source1
datum filter <...> source1
datum transform <...> source1
datum transform <...> source1
# describe the second source
datum import <...> -n source2
datum model add <...>
datum transform <...> source2
datum transform <...> source2
```
Now, the resulting dataset can be built with:
``` bash
datum export <...>
```
### Project layout
``` bash
project/
├── .dvc/
├── .dvcignore
├── .git/
├── .gitignore
├── .datumaro/
│ ├── cache/ # object cache
│ │ └── <2 leading symbols of obj hash>/
│ │ └── /
│ │ └──