Detect dataset format

This command attempts to detect the format of a dataset in a directory. Currently, only local directories are supported.

The detection result may be one of:

  • a single format being detected;

  • no formats being detected (if the dataset doesn’t match any known format);

  • multiple formats being detected (if the dataset is ambiguous).

The command outputs this result in a human-readable form and optionally as a machine-readable JSON report (see --json-report).

The format of the machine-readable report is as follows:

{
    "detected_formats": [
        "detected-format-name-1", "detected-format-name-2", ...
    ],
    "rejected_formats": {
        "rejected-format-name-1": {
            "reason": <reason-code>,
            "message": ["message line 1", "message line 2", ...]
        },
        "rejected-format-name-2": ...,
        ...
    }
}

The <reason-code> can be one of:

  • "insufficient_confidence": the dataset matched the corresponding format, but it matched at least one other format better.

  • "unmet_requirements": the dataset didn’t meet at least one requirement of the corresponding format.

Other reason codes may be defined in the future.

Usage:

datum detect-format [-h] [-p PROJECT_DIR] [--show-rejections]
                    [--json-report JSON_REPORT]
                    url

Parameters:

  • <url> - Path to the dataset to analyse.

  • -h, --help - Print the help message and exit.

  • -p, --project (string) - Directory of the project to use as the context (default: current directory). The project might contain local plugins with custom formats, which will be used for detection.

  • --show-rejections - Describe why each supported format that wasn’t detected was rejected. This only affects the human-readable output; the machine-readable report always includes rejection information.

  • --json-report (string) - Path to which to save a JSON report describing detected and rejected formats. By default, no report is saved.

Example: detect the format of a dataset in a given directory, showing rejection information:

datum detect-format --show-rejections path/to/dataset