# Filter datasets This command allows to extract a sub-dataset from a dataset. The new dataset includes only items satisfying some condition. The XML [XPath](https://devhints.io/xpath) is used as a query format. The command can be applied to a dataset or a project build target, a stage or the combined `project` target, in which case all the project targets will be affected. A build tree stage will be recorded if `--stage` is enabled, and the resulting dataset(-s) will be saved if `--apply` is enabled. By default, datasets are updated in-place. The `-o/--output-dir` option can be used to specify another output directory. When updating in-place, use the `--overwrite` parameter (in-place updates fail by default to prevent data loss), unless a project target is modified. The current project (`-p/--project`) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project's working tree is used. There are several filtering modes available (the `-m/--mode` parameter). Supported modes: - `i`, `items` - `a`, `annotations` - `i+a`, `a+i`, `items+annotations`, `annotations+items` When filtering annotations, use the `items+annotations` mode to point that annotation-less dataset items should be removed, otherwise they will be kept in the resulting dataset. To select an annotation, write an XPath that returns `annotation` elements (see examples). Item representations can be printed with the `--dry-run` parameter: ``` xml 290768 minival2014 612 612 3 80154 bbox 39 264.59 150.25 11.19 42.31 473.87 669839 bbox 41 163.58 191.75 76.98 73.63 5668.77 ... ``` The command can only be applied to a project build target, a stage or the combined `project` target, in which case all the targets will be affected. A build tree stage will be added if `--stage` is enabled, and the resulting dataset(-s) will be saved if `--apply` is enabled. Usage: ``` bash datum filter [-h] [-e FILTER] [-m MODE] [--dry-run] [--stage STAGE] [--apply APPLY] [-o DST_DIR] [--overwrite] [-p PROJECT_DIR] [target] ``` Parameters: - `` (string) - Target [dataset revpath](/docs/user-manual/how_to_use_datumaro/#revpath). By default, filters all targets of the current project. - `-e, --filter` (string) - XML XPath filter expression for dataset items - `-m, --mode` (string) - The filtering mode. Default is the `i` mode. - `--dry-run` - Print XML representations of the filtered dataset and exit. - `--stage` (bool) - Include this action as a project build step. If true, this operation will be saved in the project build tree, allowing to reproduce the resulting dataset later. Applicable only to main project targets (i.e. data sources and the `project` target, but not intermediate stages). Enabled by default. - `--apply` (bool) - Run this command immediately. If disabled, only the build tree stage will be written. Enabled by default. - `-o, --output-dir` (string) - Output directory. Can be omitted for main project targets (i.e. data sources and the `project` target, but not intermediate stages) and dataset targets. If not specified, the results will be saved inplace. - `--overwrite` - Allows to overwrite existing files in the output directory, when it is specified and is not empty. - `-p, --project` (string) - Directory of the project to operate on (default: current directory). - `-h, --help` - Print the help message and exit. Example: extract a dataset with images with `width` < `height` ``` bash datum filter \ -p test_project \ -e '/item[image/width < image/height]' ``` Example: extract a dataset with images of the `train` subset ``` bash datum filter \ -p test_project \ -e '/item[subset="train"]' ``` Example: extract a dataset with only large annotations of the `cat` class and any non-`persons` ``` bash datum filter \ -p test_project \ --mode annotations \ -e '/item/annotation[(label="cat" and area > 99.5) or label!="person"]' ``` Example: extract a dataset with non-occluded annotations, remove empty images. Use data only from the "s1" source of the project. ``` bash datum create datum import --format voc -i --name s1 datum import --format voc -i --name s2 datum filter s1 \ -m i+a -e '/item/annotation[occluded="False"]' ```