# Transform Dataset Often datasets need to be modified during preparation for model training and experimenting. In trivial cases it can be done manually - e.g. image renaming or label renaming. However, in more complex cases even simple modifications can require too much efforts, distracting the user from the real work. Datumaro provides the `datum transform` command to help in such cases. This command allows to modify dataset images or annotations all at once. > This command is designed for batch dataset processing, so if you only > need to modify few elements of a dataset, you might want to use > other approaches for better performance. A possible solution can be > a simple script, which uses [Datumaro API](/docs/developer_manual/). The command can be applied to a dataset or a project build target, a stage or the combined `project` target, in which case all the project targets will be affected. A build tree stage will be recorded if `--stage` is enabled, and the resulting dataset(-s) will be saved if `--apply` is enabled. By default, datasets are updated in-place. The `-o/--output-dir` option can be used to specify another output directory. When updating in-place, use the `--overwrite` parameter (in-place updates fail by default to prevent data loss), unless a project target is modified. The current project (`-p/--project`) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project's working tree is used. Usage: ``` bash datum transform [-h] -t TRANSFORM [-o DST_DIR] [--overwrite] [-p PROJECT_DIR] [--stage STAGE] [--apply APPLY] [target] [-- EXTRA_ARGS] ``` Parameters: - `` (string) - Target [dataset revpath](/docs/user-manual/how_to_use_datumaro/#revpath). By default, transforms all targets of the current project. - `-t, --transform` (string) - Transform method name - `--stage` (bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and the `project` target, but not intermediate stages). Enabled by default. - `--apply` (bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default. - `-o, --output-dir` (string) - Output directory. Can be omitted for main project targets (i.e. data sources and the `project` target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace. - `--overwrite` - Allows to overwrite existing files in the output directory, when it is specified and is not empty. - `-p, --project` (string) - Directory of the project to operate on (default: current directory). - `-h, --help` - Print the help message and exit. - `` - The list of extra transformation parameters. Should be passed after the `--` separator after the main command arguments. See transform descriptions for info about extra parameters. Use the `--help` option to print parameter info. Examples: - Split a VOC-like dataset randomly: ``` bash datum transform -t random_split --overwrite path/to/dataset:voc ``` - Rename images in a project data source by a regex from `frame_XXX` to `XXX`: ``` bash datum create <...> datum import <...> -n source-1 datum transform -t rename source-1 -- -e '|frame_(\d+)|\\1|' ``` #### Built-in transforms Basic dataset item manipulations: - `rename` - Renames dataset items by regular expression - `id_from_image_name` - Renames dataset items to their image filenames - `reindex` - Renames dataset items with numbers - `ndr` - Removes duplicated images from dataset - `sampler` - Runs inference and leaves only the most representative images - `resize` - Resizes images and annotations in the dataset Subset manipulations: - `random_split` - Splits dataset into subsets randomly - `split` - Splits dataset into subsets for classification, detection, segmentation or re-identification - `map_subsets` - Renames and removes subsets Annotation manipulations: - `remap_labels` - Renames, adds or removes labels in dataset - `project_labels` - Sets dataset labels to the requested sequence - `shapes_to_boxes` - Replaces spatial annotations with bounding boxes - `boxes_to_masks` - Converts bounding boxes to instance masks - `polygons_to_masks` - Converts polygons to instance masks - `masks_to_polygons` - Converts instance masks to polygons - `anns_to_labels` - Replaces annotations having labels with label annotations - `merge_instance_segments` - Merges grouped spatial annotations into a mask - `crop_covered_segments` - Removes occluded segments of covered masks - `bbox_value_decrement` - Subtracts 1 from bbox coordinates Examples: - Split a dataset randomly to `train` and `test` subsets, ratio is 2:1 ``` bash datum transform -t random_split -- --subset train:.67 --subset test:.33 ``` - Split a dataset for a specific task. The tasks supported are classification, detection, segmentation and re-identification. ``` bash datum transform -t split -- \ -t classification --subset train:.5 --subset val:.2 --subset test:.3 datum transform -t split -- \ -t detection --subset train:.5 --subset val:.2 --subset test:.3 datum transform -t split -- \ -t segmentation --subset train:.5 --subset val:.2 --subset test:.3 datum transform -t split -- \ -t reid --subset train:.5 --subset val:.2 --subset test:.3 \ --query .5 ``` - Convert spatial annotations between each other ``` bash datum transform -t boxes_to_masks datum transform -t masks_to_polygons datum transform -t polygons_to_masks datum transform -t shapes_to_boxes ``` - Set dataset labels to {`person`, `cat`, `dog`}, remove others, add missing. Original labels (can be any): `cat`, `dog`, `elephant`, `human` New labels: `person` (added), `cat` (kept), `dog` (kept) ``` bash datum transform -t project_labels -- -l person -l cat -l dog ``` - Remap dataset labels, `person` to `car` and `cat` to `dog`, keep `bus`, remove others ``` bash datum transform -t remap_labels -- \ -l person:car -l bus:bus -l cat:dog \ --default delete ``` - Rename dataset items by a regular expression - Replace `pattern` with `replacement` - Remove `frame_` from item ids ``` bash datum transform -t rename -- -e '|pattern|replacement|' datum transform -t rename -- -e '|frame_(\d+)|\\1|' ``` - Create a dataset from K the most hard items for a model. The dataset will be split into the `sampled` and `unsampled` subsets, based on the model confidence, which is stored in the `scores` annotation attribute. There are five methods of sampling (the `-m/--method` option): - `topk` - Return the k with high uncertainty data - `lowk` - Return the k with low uncertainty data - `randk` - Return the random k data - `mixk` - Return half to topk method and the rest to lowk method - `randtopk` - First, select 3 times the number of k randomly, and return the topk among them. ``` bash datum transform -t sampler -- \ -a entropy \ -i train \ -o sampled \ -u unsampled \ -m topk \ -k 20 ``` - Remove duplicated images from a dataset. Keep at most N resulting images. - Available sampling options (the `-e` parameter): - `random` - sample from removed data randomly - `similarity` - sample from removed data with ascending - Available sampling methods (the `-u` parameter): - `uniform` - sample data with uniform distribution - `inverse` - sample data with reciprocal of the number ``` bash datum transform -t ndr -- \ -w train \ -a gradient \ -k 100 \ -e random \ -u uniform ``` - Resize dataset images and annotations. Supports upscaling, downscaling and mixed variants. ```bash datum transform -t resize -- -dw 256 -dh 256 ```