HED remodeling tools

Remodeling refers to the process of transforming a tabular file into a different form in order to disambiguate the information or to facilitate a particular analysis. The remodeling operations are specified in a JSON (.json) file, giving a record of the transformations performed.

There are two types of remodeling operations: transformation and summarization. The transformation operations modify the tabular files, while summarization produces an auxiliary information file but leaves the tabular files unchanged.

The file remodeling tools can be applied to any tab-separated value (.tsv) file but are particularly useful for restructuring files representing experimental events. Please read the HED remodeling quickstart tutorials for an introduction and basic use of the tools.

The file remodeling tools can be applied to individual files using the HED online tools or to entire datasets using the remodel command-line interface either by calling Python scripts directly from the command line or by embedding calls in a Jupyter notebook. The tools are also available as HED RESTful services. The online tools are particularly useful for debugging.

This user’s guide contains the following topics:

Overview of remodeling

Remodeling consists of restructuring and/or extracting information from tab-separated value files based on a specified list of operations contained in a JSON file.

Internally, the remodeling operations represent the tabular file using a Pandas DataFrame.

Transformation operations

Transformation operations, shown schematically in the following figure, are designed to transform an incoming tabular file into a new DataFrame without modifying the incoming data.

Transformation operations

Transformation operations are stateless and do not save any context information or affect future applications of the transformation.

Transformations, themselves, do not have any output and just return a new, transformed DataFrame. In other words, transformations do not operate in place on the incoming DataFrame, but rather, they create a new DataFrame containing the result.

Typically, the calling program is responsible for reading and saving the tabular file, so the user can choose whether to overwrite or create a new file.

See the remodeling tool program interface section for information on how to call the operations.

Summarization operations

Summarization operations do not modify the input DataFrame but rather extract and save information in an internally stored summary dictionary as shown schematically in the following figure.

Summary operations

The dispatcher that executes remodeling operations can be interrogated at any time for the state information contained in the global summary dictionary and can save additional summary information at any time during execution. Usually summaries are dumped at the end of processing to the derivatives/remodel/summaries subdirectory under the dataset root.

Summarization operations may appear anywhere in the operation list, and the same type of summary may appear multiple times under different names in order to track progress.

The dispatcher stores information from each uniquely named summarization operation as a separate summary dictionary entry. Within its summary information, most summarization operations keep a separate summary for each individual file and have methods to create an overall summary of the information for all the files that have been processed by the summarization.

Summarization results are available in JSON (.json) and text (.txt) formats.

Available operations

The following table lists the available remodeling operations with brief example use cases and links to further documentation. Operations not listed in the summarize section are transformations.

Summary of the HED remodeling operations for tabular files.

Category

Operation

Example use case

clean-up

remove_columns

Remove temporary columns created during restructuring.

remove_rows

Remove rows with n/a values in a specified column.

rename_columns

Make columns names consistent across a dataset.

reorder_columns

Make column order consistent across a dataset.

factor

factor_column

Extract factor vectors from a column of condition variables.

factor_hed_tags

Extract factor vectors from search queries of HED annotations.

factor_hed_type

Extract design matrices and/or condition variables.

restructure

merge_consecutive

Replace multiple consecutive events of the same type
with one event of longer duration.

remap_columns

Create m columns from values in n columns (for recoding).

split_rows

Split trial-encoded rows into multiple events.

summarize

summarize_column_names

Summarize column names and order in the files.

summarize_column_values

Count the occurrences of the unique column values.

summarize_hed_tags

Summarize the HED tags present in the
HED annotations for the dataset.

summarize_hed_type

Summarize the detailed usage of a particular type tag
such as Condition-variable or Task
(used to automatically extract experimental designs).

summarize_hed_validation

Validate the data files and report any errors.

summarize_sidecar_from_events

Generate a sidecar template from an event file.

The clean-up operations are used at various phases of restructuring to assure consistency across dataset files.

The factor operations produce column vectors with the same number of rows as the data file from which they were calculated. They encode condition variables, design matrices, or other search criteria. See the HED conditions and design matrices for more information on factoring and analysis.

The restructure operations modify the way in which a data file represents its information.

The summarize operations produce dataset-wide summaries of various aspects of the data files as well as summaries of the individual files.

Installing the remodel tools

The remodeling tools are available in the GitHub hed-python repository along with other tools for data cleaning and curation. Although version 0.1.0 of this repository is available on PyPI as hedtools, the version containing the restructuring tools (Version 0.2.0) is still under development and has not been officially released. However, the code is publicly available on the develop branch of the hed-python repository and can be directly installed from GitHub using pip:

pip install git+https://github.com/hed-standard/hed-python/@develop

The web services and online tools supporting remodeling are available on the HED online tools dev server. When version 0.2.0 of hedtools is officially released on PyPI, restructuring will become available on the released HED online tools. A docker version is also under development.

The following diagram shows a schematic of the remodeling process.

Event remodeling process

Initially, the user creates a backup of the specified tabular files (usually events.tsv files). This backup is a mirror of the data files in the dataset, but is located in the derivatives/remodel/backups directory and never modified once the backup is created.

Remodeling applies a sequence of operations specified in a JSON remodel file to the backup versions of the data files. The JSON remodel file provides a record of the operations performed on the file. If the user detects a mistake in the transformations, he/she can correct the transformation file and rerun the transformations.

Remodeling always runs on the original backup version of the file rather than the transformed version, so the transformations can always be corrected and rerun. It is possible to by-pass the backup, particularly if only using summarization operations, but this is not recommended and should be done with care.

Remodel command-line interface

The remodeling toolbox provides Python scripts with command-line interfaces to create or restore backups and to apply operations to the files in a dataset. The file remodeling tools may be applied to datasets that are in free form under a directory root or that are in BIDS-format. Some operations use HED (Hierarchical Event Descriptors) annotations. See the Remodel with HED section for a discussion of these operations and how to use them.

The remodeling command-line interface can be used from the command line, called from another Python program, or used in a Jupyter notebooks. Example Jupyter notebooks using the remodeling commands can be found here.

Calling remodel tools

The remodeling tools provide three Python programs for backup (run_remodel_backup), remodeling (run_remodel) and restoring (run_remodel_restore) event files. These programs can be called from the command line or from another Python program.

The programs use a standard command-line argument list for specifying input as summarized in the following table.

Summary of command-line arguments for the remodeling programs.

Script name

Arguments

Purpose

run_remodel_backup

data_dir
-bd --backup-dir
-bn --backup-name
-e --extensions
-f --file-suffix
-t --task-names
-v --verbose
-x --exclude-dirs

Create a backup event files.

run_remodel

data_dir
model_path
-b --bids-format
-bd --backup-dir
-bn --backup-name
-e --extensions
-f --file-suffix
-i --individual-summaries
-j --json-sidecar
-ld --log-dir
-nb --no-backup
-ns --no-summaries
-nu --no-update
-r --hed-version
-s --save-formats
-t --task-names
-v --verbose
-w --work-dir
-x --exclude-dirs

Restructure or summarize the event files.

run_remodel_restore

data_dir
-bd --backup-dir
-bn --backup-name
-t --task-names
-v --verbose

Restore a backup of event files.

All the scripts have a required argument, which is the full path of the dataset root (data_dir). The run_remodel program has a required parameter which is the full path of a JSON file containing a specification of the remodeling commands to be run.

Remodel command-line arguments

This section describes the arguments that are used for the remodeling command-line interface with examples and more details.

Positional arguments

Positional arguments are required and must be given in the order specified.

data_dir

The full path of dataset root directory.

model_path

The full path of the JSON remodel file (for run_remodel only).

Named arguments

Named arguments consist of a key starting with a hyphen and are possibly followed by a value. Named arguments can be given in any order or omitted. If omitted, a specified default is used. Argument keys and values are separated by spaces.

For argument values that are lists, the key is given followed by the items in the list, all separated by spaces.

Each command has two different forms of the key name: a short form (a single hyphen followed by a single character) and a longer form (two hyphens followed by a more self-explanatory name). Users are free to use either form.

-b, --bids-format

If this flag present, the dataset is in BIDS format with sidecars. Tabular files and their associated sidecars are located using BIDS naming.

-bd, --backup-dir

The path to the directory holding the backups (default: [data_root]/derivatives/remodel/backups). Use the -nb option if you wish to omit the backup (in run_remodel).

-bn, --backup-name

The name of the backup used for the remodeling (default: default_back).

-e, --extensions

This option is followed by a list of file extension(s) of the data files to process. The default is .tsv. Comma separated tabular files are not permitted.

-f, --file-suffix

This option is followed by the suffix names of the files to be processed. For example events (the default) captures files named events.tsv if the default extension is used. The filename without the extension must end in one of the specified suffixes in order to be backed up or transformed.

-i, --individual-summaries

This option offers a choice among three options:

  • separate: Individual summaries for each file in separate files in addition the overall summary.

  • consolidated: Individual summaries written in the same file as the overall summary.

  • none: Only an overall summary.

-j, --json-sidecar

This option is followed by the full path of the JSON sidecar with HED annotations to be applied during the processing of HED-related remodeling operations.

-ld, --log-dir

This option is followed by the full path of a directory for writing log files. A log file is written if the remodeling tools raise an exception and the program terminates. Note that a log file is not written for issues gathered during operations such as summarize_hed_valistion because reporting HED validation errors is a normal part of this operation. On the other hand, errors in the JSON remodeling file do raise and exception and are reported in the log.

-nb, --no-backup

If present, no backup is used. Rather operations are performed directly on the files.

-ns, --no-summaries

If present, no summary files are output.

-nu, --no-update

If present, the modified files are not output.

-r, --hed-versions

This option is followed by one or more HED versions. Versions of the standard schema are specified by their semantic versions (e.g., 8.1.0), while library schema versions are prefixed by their library name (e.g., score_1.0.0).

If more than one HED schema version is given, all but one of the versions must start with an additional namespace designator (e.g., sc:). At most one version can omit the namespace designator when multiple schema are being used. In annotations, tags must start with the namespace designator of the corresponding schema from which they were selected (e.g. sc:Sleep-modulator if the SCORE library was designated by sc:score_1.0.0).

-s, --save-formats

This option is followed by the extensions (including .) of the formats in which to save summaries (default: .txt .json).

-t, --task-names

The name(s) of the tasks to be included (for BIDS-formatted files only). When a dataset includes multiple tasks, the event files are often structured differently for each task and thus require different transformation files. This option allows the backups and operations to be restricted to an individual task.

If this option is omitted, all tasks are used. This means that all events.tsv files are restored from a backup if the backup is used, the operations are performed on all events.tsv files, and summaries are combined over all tasks.

If a list of specific task names follows this option, only datafiles corresponding to the listed tasks are processed giving separate summaries for each listed task.

If a “*” follows this option, all event files are processed and separate summaries are created for each task.

Task detection follows the BIDS convention. Tasks are detected by finding “task-x” in the file names of events.tsv files. Here x is the name of the task. The task name is followed by an underbar, by a period, or be at the end of the filename.

-v, --verbose

If present, more comprehensive messages documenting transformation progress are printed to standard output.

-w, --work-dir

The path to the remodeling work root directory –both for summaries (default: [data_root]/derivatives/remodel). Use the -nb option if you wish to omit the backup (in run_remodel).

-x, --exclude-dirs

The directories to exclude when gathering the data files to process. For BIDS datasets, these are typically derivatives, stimuli, and sourcecode. Any subdirectory with a path component named remodel is automatically excluded from remodeling, as these directories are reserved for storing backup, state, and result information for the remodeling process itself.

Remodel scripts

This section discusses the three main remodeling scripts with command-line interfaces to support backup, remodeling, and restoring the tabular files used in the remodeling process. These scripts can be run from the command line or from another Python program using a function call.

Backing up files

The run_remodel_backup Python program creates a backup of the specified files. The backup is always created in the derivatives/remodel/backups subdirectory under the dataset root as shown in the following example for the sample dataset eeg_ds003645s_hed_remodel, which can be found in the datasets subdirectory of the hed-examples GitHub repository.

Remodeling backup structure

The backup process creates a mirror of the directory structure of the source files to be backed up in the directory derivatives/remodel/backups/backup_name/backup_root as shown in the figure above. The default backup name is default_back.

In the above example, the backup has subdirectories sub-002 and sub-003 just like the main directory of the dataset. These subdirectories only contain backups of the files to be transformed (by default files with names ending in events.tsv).

In addition to the backup_root, the backup directory also contains a dictionary of backup files in the backup_lock.json file. This dictionary is used internally by the remodeling tools. The backup should be created once and not modified by the user.

The following example shows how to run the run_remodel_backup program from the command line to back up the dataset located at /datasets/eeg_ds003645s_hed_remodel.

Example of calling run_remodel_backup from the command line.

python run_remodel_backup /datasets/eeg_ds003645s_hed_remodel -x derivatives stimuli

Since the -f and -e arguments are not given, the default file suffix and extension values apply, so only files of the form events.tsv are backed up. The -x option excludes any source files from the derivatives and stimuli subdirectories. These choices can be overridden using additional command-line arguments.

The following shows how the run_remodel_backup program can be called from a Python program or a Jupyter notebook. The command-line arguments are given in a list instead of on the command line.

Example of Python code to call run_remodel_backup using a function call.


import hed.tools.remodeling.cli.run_remodel_backup as cli_backup

data_root = '/datasets/eeg_ds003645s_hed_remodel'
arg_list = [data_root, '-x', 'derivatives', 'stimuli']
cli_backup.main(arg_list)

During remodeling, each file in the source is associated with a backup file using its relative path from the dataset root. Remodeling is performed by reading the backup file, performing the operations specified in the JSON remodel file, and overwriting the source file as needed.

Users can also create alternatively named backups by providing the -n argument with a backup name to the run_remodel_backup program. To use backup files from another named backup, call the remodeling program with the -n argument and the correct backup name. Named backups can provide checkpoints to allow the execution of transformations to start from intermediate points.

NOTE: You should not delete backups, even if you have created multiple named backups. The backups provide useful state and provenance information about the data.

Remodeling files

Remodeling consists of applying a sequence of operations from the remodel operation summary to successively transform each backup file according to the instructions and to overwrite the actual files with the final result.

If the dataset has no backups, the actual data files rather than the backups are transformed. You are expected to create the backup (just once) before running the remodeling operations. Going without backup is not recommended unless you are only doing summarization operations.

The operations are specified as a list of dictionaries in a JSON file in the remodel sample files as discussed below.

Before running remodeling transformations on an entire dataset, consider using the HED online tools to debug your remodeling operation file on a single file. The remodeling process always starts with the original backup files, so the usual development path is to incrementally add operations to the end of your transformation JSON file as you develop and test on a single file until you have the desired end result.

The following example shows how to run a remodeling script from the command line. The example assumes that the backup has already been created for the dataset.

Example of calling run_remodel from the command line.

python run_remodel /datasets/eeg_ds003645s_hed_remodel /datasets/remove_extra_rmdl.json -x derivatives simuli

The script has two required arguments the dataset root and the path to the JSON remodel file. Usually, the JSON remodel files are stored with the dataset itself in the derivatives/remodel/remodeling_files subdirectory, but common scripts can be stored in a central place elsewhere.

The additional keyword option, -x in the example indicates that directory paths containing the component derivatives or the component stimuli should be excluded. Excluded directories need not have their excluded path component at the top level of the dataset. Subdirectory paths containing the remodel path component are automatically excluded.

The command-line interface can also be used in a Jupyter notebook or as part of a larger Python program by calling the main function with the equivalent command-line arguments provided in a list with the positional arguments appearing first.

The following example shows Python code to remodel a dataset using the command-line interface. This code can be used in a Jupyter notebook or in another Python program.

Example Python code to call run_remodel using a function call.

import hed.tools.remodeling.cli.run_remodel as cli_remodel

data_root = '/datasets/eeg_ds003645s_hed_remodel'
model_path = '/datasets/remove_extra_rmdl.json'
arg_list = [data_root, model_path, '-x', 'derivatives', 'stimuli']
cli_remodel.main(arg_list)

Restoring files

Since remodeling always uses the backed up version of each data file, there is no need to restore these files to their original state between remodeling runs. However, when finished with an analysis, you may want to restore the data files to their original state.

The following example shows how to call run_remodel_restore to restore the data files from the default backup. The restore operation restores all the files in the specified backup.

Example of calling run_remodel_restore from the command line.

python run_remodel_restore /datasets/eeg_ds003645s_hed_remodel

As with the other command-line programs, run_remodel_restore can be also called using a function call.

Example Python code to call run_remodel_restore using a function call.

import hed.tools.remodeling.cli.run_restore as cli_remodel

data_root = '/datasets/eeg_ds003645s_hed_remodel'
cli_remodel.main([data_root])

Remodel with HED

HED (Hierarchical Event Descriptors) is a system for annotating data in a manner that is both human-understandable and machine-actionable. HED provides much more detail about the events and their meanings, If you are new to HED see the HED annotation quickstart. For information about HED’s integration into BIDS (Brain Imaging Data Structure) see BIDS annotation quickstart.

Currently, five remodeling operations rely on HED annotations:

HED tags provide a mechanism for advanced data analysis and for extracting experiment-specific information from the data files. However, since HED information is not always stored in the data files themselves, you may need to provide a HED schema and a JSON sidecar.

The HED schema defines the allowed HED tag vocabulary, and the JSON sidecar associates HED annotations with the information in the columns of the event files. If you are not using any of the HED operations in your remodeling, you do not have to provide this information.

Extracting HED information from BIDS

The simplest way to use HED with run_remodel is to use the -b option, which indicates that the dataset is in BIDS (Brain Imaging Data Structure) format.

BIDS is a standardized way of organizing neuroimaging data. HED and BIDS are well integrated. If you are new to BIDS, see the BIDS annotation quickstart.

A HED-annotated BIDS dataset provides the HED schema version in the dataset_description.json file located directly under the BIDS dataset root.

BIDS datasets must have filenames in a specific format, and the HED tools can locate the relevant JSON sidecars for each data file based on this information.

Directly specifying HED information

If your data is already in BIDS format, using the -b option is ideal since the needed information can be located automatically. However, early in the experimental process, your datafiles are not likely to be organized in BIDS format, and this option will not be available if you want to use HED.

Without the -b option, the remodeling tools locate the appropriate files based on specified filename suffixes and extensions. In order to use HED operations, you must explicitly specify the HED versions using the -r option. The -r option supports a list of HED versions if multiple HED schemas are used. For example: -r 8.1.0 sc:score_1.0.0 specifies that vocabulary will be drawn from standard HED Version 8.1.0 and from HED SCORE library version 1.0.0. Annotations containing tags from SCORE should be prefixed with sc:. Note: both of the schemas can be viewed by the HED Schema Viewer.

Usually, annotators will consolidate HED annotations in a single JSON sidecar file located at the top-level of the dataset. The path of this sidecar can be passed as a command-line argument using the -j option. If more than one JSON sidecar file contains HED annotations, users will need to call the lower-level remodeling functions to perform these operations.

The following example illustrates a command-line call that passes both a HED schema version and the path to the JSON file with the HED annotations.

Remodeling a non-BIDS dataset using HED.

python run_remodel /datasets/eeg_ds003645s_hed_remodel /datasets/summarize_conditions_rmdl.json \
-x derivatives simuli -r 8.1.0 -j /datasets/eeg_ds003645s_hed_remodel/task-FacePerception_events.json

Example Python code to use run_remodel on a non-BIDS dataset.

import hed.tools.remodeling.cli.run_remodel as cli_remodel

data_root = '/datasets/eeg_ds003645s_hed_remodel'
model_path = '/datasets/summarize_conditions_rmdl.json'
json_path = '/datasets/eeg_ds003645s_hed_remodel/task-FacePerception_events.json'
arg_list = [data_root, model_path, '-x', 'derivatives', 'stimuli', '-r' 8.1.0 '-j' json_path]
cli_remodel.main(arg_list)

Remodel error handling

Errors can occur during several stages in during remodeling and how they are handled depends on the type of error and where the error occurs. Except for the validation summary, the underlying remodeling code raises exceptions for most errors.

Errors in the remodel file

Each operation requires specific parameters to execute properly. The underlying implementation for each operation defines these parameters using a json schema as the PARAMS property of the operation’s class definition. The use of the JSON schema allows the remodeler to specify and validate requirements on most of an operation’s parameters using standardized methods.

The remodeler_validator compiles a JSON schema for the remodeler from individual operations and validates the remodel file against the compiled JSON schema. The validator should always before executing any remodel operations.

For example, the command line run_remodel program calls the validator before executing any operations. If there are errors, run_remodel reports the errors for all operations and exits. This allows users to correct errors in all operations in one pass without any data modification. The HED online tools are particularly useful for debugging the syntax and other issues in the remodeling process.

Execution-time remodel errors

When an error occurs during execution, an exception is raised. Exceptions are raised for invalid or missing files or if a transformed file is unable to be rewritten due to improper file permissions. Each individual operation may also raise an exception if the data file being processed does not have the expected information, such as a column with a particular name.

Exceptions raised during execution cause the process to be terminated and no further files are processed.

Remodel sample files

All remodeling operations are specified in a standardized JSON remodel input file. The following shows the contents of the JSON remodeling file remove_extra_rmdl.json, which contains a single operation with instructions to remove the value and sample columns from the data file if the columns exist.

Sample remodel file

A sample JSON remodeling file with a single remove_columns transformation operation.

[
    {
        "operation": "remove_columns",
        "description": "Remove unwanted columns prior to analysis",
        "parameters": {
            "remove_names": ["value", "sample"]
        }
    }
]

Each operation is specified in a dictionary with three top-level keys: “operation”, “description”, and “parameters”. The value of the “operation” is the name of the operation. The “description” value should include the reason this operation was needed, not just a description of the operation itself. Finally, the “parameters” value is a dictionary mapping parameter name to parameter value.

The parameters for each operation are listed in Remodel transformations and Remodel summarizations sections. An operation may have both required and optional parameters. Optional parameters may be omitted if unneeded, but all parameters are specified in the “parameters” section of the dictionary. The full specification of the remodel file is also provided as a JSON schema.

The remodeling JSON files should have names ending in _rmdl.json to more easily distinguish them from other JSON files. Although these files can be stored anywhere, their preferred location is in the derivatives/remodel/models subdirectory under the dataset root so that they can provide provenance for the dataset.

Sample remodel event file

Several examples illustrating the remodeling operations use the following excerpt of the stop-go task from sub-0013 of the AOMIC-PIOP2 dataset available on OpenNeuro as ds002790. The full event file is sub-0013_task-stopsignal_acq-seq_events.tsv.

Excerpt from an event file from the stop-go task of AOMIC-PIOP2 (ds002790).

onset

duration

trial_type

stop_signal_delay

response_time

response_accuracy

response_hand

sex

0.0776

0.5083

go

n/a

0.565

correct

right

5.5774

0.5083

unsuccesful_stop

0.2

0.49

correct

right

female

9.5856

0.5084

go

n/a

0.45

correct

right

female

13.5939

0.5083

succesful_stop

0.2

n/a

n/a

n/a

female

17.1021

0.5083

unsuccesful_stop

0.25

0.633

correct

left

male

21.6103

0.5083

go

n/a

0.443

correct

left

male

Sample remodel sidecar file

For remodeling operations that use HED, a JSON sidecar is usually required to provide the necessary HED annotations. The following JSON sidecar excerpt is used in several examples to illustrate some of these operations. The full JSON file can be found at task-stopsiqnal_acq-seq_events.json.

Excerpt of JSON sidecar with HED annotations for the stop-go task of AOMIC-PIOP2.

{
    "trial_type": {
        "HED": {
            "succesful_stop": "Sensory-presentation, Visual-presentation, Correct-action, Image, Label/succesful_stop",
            "unsuccesful_stop": "Sensory-presentation, Visual-presentation, Incorrect-action, Image, Label/unsuccesful_stop",
            "go": "Sensory-presentation, Visual-presentation, Image, Label/go"
        }
    },
    "stop_signal_delay": {
        "HED": "(Auditory-presentation, Delay/# s)"
        },
    "sex": {
        "HED": {
            "male": "Def/Male-image-cond",
            "female": "Def/Female-image-cond"
        }
    },
    "hed_defs": {
        "HED": {
            "def_male": "(Definition/Male-image-cond, (Condition-variable/Image-sex, (Male, (Image, Face))))",
            "def_female": "(Definition/Female-image-cond, (Condition-variable/Image-sex, (Female, (Image, Face))))"
        }
    }
}

Notice that the JSON file has some keys (e.g., “trial_type”, “stop_signal_delay”, and “sex”) which also correspond to columns in the events file. The “hed_defs” key corresponds to an extra entry in the JSON file that, in this case, provides the definitions needed in the HED annotation.

HED operations also require the HED schema. Most of the examples use HED standard schema version 8.1.0.

Remodel transformations

Factor column

The factor_column operation appends factor vectors to tabular files based on the values in a specified file column. Each factor vector contains a 1 if the corresponding row had that column value and a 0 otherwise. The factor_column is used to reformat event files for analyses such as linear regression based on column values.

Factor column parameters

Parameters for the factor_column operation.

Parameter

Type

Description

column_name

str

The name of the column to be factored.

factor_values

list

Column values to be included as factors.

factor_names

list

(Optional) Column names for created factors.

If column_name is not a column in the data file, a ValueError is raised.

If factor_values is empty, factors are created for each unique value in column_name. Otherwise, only factors for the specified column values are generated. If a specified value is missing in a particular file, the corresponding factor column contains all zeros.

If factor_names is empty, the newly created columns are of the form column_name.factor_value. Otherwise, the newly created columns have names factor_names. If factor_names is not empty, then factor_values must also be specified and both lists must be of the same length.

Factor column example

The factor_column operation in the following example specifies that factor columns should be created for succesful_stop and unsuccesful_stop of the trial_type column. The resulting columns are called stopped and stop_failed, respectively.

A sample JSON file with a single factor_column transformation operation.

[{ 
    "operation": "factor_column",
    "description": "Create factors for the succesful_stop and unsuccesful_stop values.",
    "parameters": {
        "column_name": "trial_type",
        "factor_values": ["succesful_stop", "unsuccesful_stop"],
        "factor_names": ["stopped", "stop_failed"]
    }
}]

The results of executing this factor_column operation on the sample remodel event file are:

Results of the factor_column operation on the samplepip data.

onset

duration

trial_type

stop_signal_delay

response_time

response_accuracy

response_hand

sex

stopped

stop_failed

0.0776

0.5083

go

n/a

0.565

correct

right

female

0

0

5.5774

0.5083

unsuccesful_stop

0.2

0.49

correct

right

female

0

1

9.5856

0.5084

go

n/a

0.45

correct

right

female

0

0

13.5939

0.5083

succesful_stop

0.2

n/a

n/a

n/a

female

1

0

17.1021

0.5083

unsuccesful_stop

0.25

0.633

correct

left

male

0

1

21.6103

0.5083

go

n/a

0.443

correct

left

male

0

0

Factor HED tags

The factor_hed_tags operation is similar to the factor_column operation in that it produces factor vectors containing 0’s and 1, which are appended to the returned DataFrame. However, rather than basing these vectors on values in a specified column, the factors are computed by determining whether the assembled HED annotations for each row satisfies a specified search query.

An example search query is whether the assembled HED annotation contains a particular HED tag. The HED search guide tutorial discusses the HED search facility in more detail.

Factor HED tags parameters

Parameters for the factor_hed_tags operation.

Parameter

Type

Description

queries

list

A list of HED query strings.

query_names

list

(Optional) A list of names for the factor columns generated by the queries.

remove_types

list

(Optional) Structural HED tags to be removed (usually Condition-variable and Task).

expand_context

bool

(Optional: default True) Expand the context and remove
Onset andOffset tags before the query.

The query_names list, which must be empty or the same length as queries, contains the names of the factor columns produced by the search. If the query_names list is empty, the result columns are titled “query_1”, “query_2”, etc.

Most of the time the remove_types should be set to ["Condition-variable", "Task"] and the effects of the experimental design captured using the factor_hed_types_op. If expand_context is set to false, the additional context provided by Onset, Offset, and Duration is ignored.

Factor HED tags example

The factor_hed-tags operation in the following example produce two factor columns with 1’s where the HED string for a row contains the Correct-action and Incorrect-action, respectively. The resulting factor columns are named correct and incorrect, respectively.

A sample JSON file with a single factor_hed_tags transformation operation.

[{ 
    "operation": "factor_hed_tags",
    "description": "Create factors based on whether the event represented a correct or incorrect action.",,
    "parameters": {
        "queries": ["correct-action", "incorrect-action"],
        "query_names": ["correct", "incorrect"],
        "remove_types": ["Condition-variable", "Task"],
        "expand_context": false
    }
}]

The results of executing this factor_hed-tags operation on the sample remodel event file using the sample remodel sidecar file for HED annotations is:

Results of factor_hed_tags.

onset

duration

trial_type

stop_signal_delay

response_time

response_accuracy

response_hand

sex

correct

incorrect

0.0776

0.5083

go

n/a

0.565

correct

right

female

0

0

5.5774

0.5083

unsuccesful_stop

0.2

0.49

correct

right

female

0

1

9.5856

0.5084

go

n/a

0.45

correct

right

female

0

0

13.5939

0.5083

succesful_stop

0.2

n/a

n/a

n/a

female

1

0

17.1021

0.5083

unsuccesful_stop

0.25

0.633

correct

left

male

0

1

21.6103

0.5083

go

n/a

0.443

correct

left

male

0

0

Factor HED type

The factor_hed_type operation produces factor columns based on values of the specified HED type tag. The most common type is the HED Condition-variable tag, which corresponds to factor vectors based on the experimental design. Other commonly use type tags include Task, Control-variable, and Time-block.

We assume that the dataset has been annotated using HED tags to properly document information such as experimental conditions, and focus on how such an annotated dataset can be used with remodeling to produce factor columns corresponding to these type variables.

For additional information on how to encode experimental designs using HED, see HED conditions and design matrices.

Factor HED type parameters

Parameters for factor_hed_type operation.

Parameter

Type

Description

type_tag

str

HED tag used to find the factors (most commonly Condition-variable).

type_values

list

(Optional) Values to factor for the type_tag.
If omitted, all values of that type_tag are used.

The event context (as defined by onsets, offsets and durations) is always expanded and one-hot (0’s and 1’s) encoding is used for the factors.

Factor HED type example

The factor_hed_type operation in the following example appends additional columns to each data file corresponding to each possible value of each Condition-variable tag. The columns contain 1’s for rows corresponding to rows (e.g., events) for which that condition applies and 0’s otherwise.

A JSON file with a single factor_hed_type transformation operation.

[{ 
    "operation": "factor_hed_type",
    "description": "Factor based on the sex of the images being presented.",
    "parameters": {
        "type_tag": "Condition-variable"
    }
}]

The results of executing this factor_hed-tags operation on the sample remodel event file using the sample remodel sidecar file for HED annotations are:

Results of factor_hed_type.

onset

duration

trial_type

stop_signal_delay

response_time

response_accuracy

response_hand

sex

Image-sex.Female-image-cond

Image-sex.Male-image-cond

0.0776

0.5083

go

n/a

0.565

correct

right

female

1

0

5.5774

0.5083

unsuccesful_stop

0.2

0.49

correct

right

female

1

0

9.5856

0.5084

go

n/a

0.45

correct

right

female

1

0

13.5939

0.5083

succesful_stop

0.2

n/a

n/a

n/a

female

1

0

17.1021

0.5083

unsuccesful_stop

0.25

0.633

correct

left

male

0

1

21.6103

0.5083

go

n/a

0.443

correct

left

male

0

1

Merge consecutive

Sometimes a single long event in experimental logs is represented by multiple repeat events. The merge_consecutive operation collapses these consecutive repeat events into one event with duration updated to encompass the temporal extent of the merged events.

Merge consecutive parameters

Parameters for the merge_consecutive operation.

Parameter

Type

Description

column_name

str

The name of the column which is the basis of the merge.

event_code

str, int, float

The value in column_name that triggers the merge.

set_durations

bool

If true, set durations based on merged events.

ignore_missing

bool

If true, missing column_name or match_columns do not raise an error.

match_columns

list

(Optional) Columns whose values must match to collapse events.

The first of the group of rows (each representing an event) to be merged is called the anchor for the merge. After the merge, it is the only row in the group that remains in the data file. The result is identical to its original version, except for the value in the duration column.

If the set_duration parameter is true, the new duration is calculated as though the event began with the onset of the first event (the anchor row) in the group and ended at the point where all the events in the group have ended. This method allows for small gaps between events and for events in which an intermediate event in the group ends after later events. If the set_duration parameter is false, the duration of the merged row is set to n/a.

If the data file has other columns besides onset, duration and column_name, the values in the other columns must be considered during the merging process. The match_columns is a list of the other columns whose values must agree with those of the anchor row in order for a merge to occur. If match_columns is empty, the other columns in each row are not taken into account during the merge.

Merge consecutive example

The merge_consecutive operation in the following example causes consecutive succesful_stop events whose stop_signal_delay, response_hand, and sex columns have the same values to be merged into a single event.

A JSON file with a single merge_consecutive transformation operation.

[{ 
    "operation": "merge_consecutive",
    "description": "Merge consecutive *succesful_stop* events that match the *match_columns.",
    "parameters": {
        "column_name": "trial_type",
        "event_code": "succesful_stop",
        "set_durations": true,
        "ignore_missing": true,
        "match_columns": ["stop_signal_delay", "response_hand", "sex"]
    }
}]

When this operation is applied to the following input file, the three events with a value of succesful_stop in the trial_type column starting at onset value 13.5939 are merged into a single event.

Input file for a merge_consecutive operation.

onset

duration

trial_type

stop_signal_delay

response_hand

sex

0.0776

0.5083

go

n/a

right

female

5.5774

0.5083

unsuccesful_stop

0.2

right

female

9.5856

0.5084

go

n/a

right

female

13.5939

0.5083

succesful_stop

0.2

n/a

female

14.2

0.5083

succesful_stop

0.2

n/a

female

15.3

0.7083

succesful_stop

0.2

n/a

female

17.3

0.5083

unsuccesful_stop

0.25

n/a

female

19.0

0.5083

unsuccesful_stop

0.25

n/a

female

21.1021

0.5083

unsuccesful_stop

0.25

left

male

22.6103

0.5083

go

n/a

left

male

Notice that the succesful_stop event at onset value 17.3 is not merged because the stop_signal_delay column value does not match the value in the previous event. The final result has duration computed as 2.4144 = 15.3 + 0.7083 - 13.5939.

The results of the merge_consecutive operation.

onset

duration

trial_type

stop_signal_delay

response_hand

sex

0.0776

0.5083

go

n/a

right

female

5.5774

0.5083

unsuccesful_stop

0.2

right

female

9.5856

0.5084

go

n/a

right

female

13.5939

2.4144

succesful_stop

0.2

n/a

female

17.3

2.2083

unsuccesful_stop

0.25

n/a

female

21.1021

0.5083

unsuccesful_stop

0.25

left

male

22.6103

0.5083

go

n/a

left

male

The events that had onsets at 17.3 and 19.0 are also merged in this example

Remap columns

The remap_columns operation maps combinations of values in m specified columns of a data file into values in n columns using a defined mapping. Remapping is useful during analysis to create columns in event files that are more directly useful or informative for a particular analysis.

Remapping is also important during the initial generation of event files from experimental logs. The log files generated by experimental control software often generate a code for each type of log entry. Remapping can be used to convert the column containing these codes into one or more columns with more informative information.

Remap columns parameters

Parameters for the remap_columns operation.

Parameter

Type

Description

source_columns

list

A list of m names of the source columns for the map.

destination_columns

list

A list of n names of the destination columns for the map.

map_list

list

A list of mappings. Each element is a list of m source
column values followed by n destination values.
Mapping source values are treated as strings.

ignore_missing

bool

If false, source column values not in the map generate “n/a”
destination values instead of errors.

integer_sources

list

(Optional) A list of source columns that are integers.
The integer_sources must be a subset of source_columns.

A column cannot be both a source and a destination, and all source columns must be present in the data files. New columns are created for destination columns that are missing from a data file.

The remap_columns operation only works for columns containing strings or integers, as it is meant for remapping categorical codes. You must specify which source columns contain integers so that n/a values can be handled appropriately.

The map_list parameter specifies how each unique combination of values from the source columns will be mapped into the destination columns. If there are m source columns and n destination columns, then each entry in map_list must be a list with m + n elements. The first m elements are the key values from the source columns. The map_list should have targets for all combinations of values that appear in the m source columns unless ignore_missing is true.

After remapping, the tabular file will contain both source and destination columns. If you wish to replace the source columns with the destination columns, use a remove_columns transformation after the remap_columns.

Remap columns example

The remap_columns operation in the following example creates a new column called response_type based on the unique values in the combination of columns response_accuracy and response_hand.

A JSON file with a single remap_columns transformation operation.

[{ 
    "operation": "remap_columns",
    "description": "Map response_accuracy and response hand into a single column.",
    "parameters": {
        "source_columns": ["response_accuracy", "response_hand"],
        "destination_columns": ["response_type"],
        "map_list": [["correct", "left", "correct_left"],
                     ["correct", "right", "correct_right"],
                     ["incorrect", "left", "incorrect_left"],
                     ["incorrect", "right", "incorrect_left"],
                     ["n/a", "n/a", "n/a"]],
        "ignore_missing": true
    }
}]

In this example there are two source columns and one destination column, so each entry in map_list must be a list with three elements two source values and one destination value. Since all the values in map_list are strings, the optional integer_sources list is not needed.

The results of executing the previous remap_column command on the sample remodel event file are:

Mapping columns response_accuracy and response_hand into a response_type column.

onset

duration

trial_type

stop_signal_delay

response_time

response_accuracy

response_hand

sex

response_type

0.0776

0.5083

go

n/a

0.565

correct

right

female

correct_right

5.5774

0.5083

unsuccesful_stop

0.2

0.49

correct

right

female

correct_right

9.5856

0.5084

go

n/a

0.45

correct

right

female

correct_right

13.5939

0.5083

succesful_stop

0.2

n/a

n/a

n/a

female

n/a

17.1021

0.5083

unsuccesful_stop

0.25

0.633

correct

left

male

correct_left

21.6103

0.5083

go

n/a

0.443

correct

left

male

correct_left

In this example, remap_columns combines the values from columns response_accuracy and response_hand to produce a new column called response_type that specifies both response hand and correctness information using a single code.

Remove columns

Sometimes columns are added during intermediate processing steps. The remove_columns operation is useful for cleaning up unnecessary columns after these processing steps complete.

Remove columns parameters

Parameters for the remove_columns operation.

Parameter

Type

Description

column_names

list of str

A list of columns to remove.

ignore_missing

boolean

If true, missing columns are ignored, otherwise raise KeyError.

If one of the specified columns is not in the file and the ignore_missing parameter is false, a KeyError is raised for the missing column.

Remove columns example

The following example specifies that the remove_columns operation should remove the stop_signal_delay, response_accuracy, and face columns from the tabular data.

A JSON file with a single remove_columns transformation operation.

[{   
    "operation": "remove_columns",
    "description": "Remove extra columns before the next step.",
    "parameters": {
        "column_names": ["stop_signal_delay", "response_accuracy", "face"],
        "ignore_missing": true
    }
}]

The results of executing this operation on the sample remodel event file are shown below. The face column is not in the data, but it is ignored, since ignore_missing is true. If ignore_missing had been false, a KeyError would have been raised.

Results of executing the remove_columns.

onset

duration

trial_type

response_time

response_hand

sex

0.0776

0.5083

go

0.565

right

female

5.5774

0.5083

unsuccesful_stop

0.49

right

female

9.5856

0.5084

go

0.45

right

female

13.5939

0.5083

succesful_stop

n/a

n/a

female

17.1021

0.5083

unsuccesful_stop

0.633

left

male

21.6103

0.5083

go

0.443

left

male

Remove rows

The remove_rows operation eliminates rows in which the named column has one of the specified values. This operation is useful for removing event markers corresponding to particular types of events or, for example having n/a in a particular column.

Remove rows parameters

Parameters for remove_rows.

Parameter

Type

Description

column_name

str

The name of the column to be tested.

remove_values

list

A list of values to be tested for removal.

The operation does not raise an error if a data file does not have a column named column_name or is missing a value in remove_values.

Remove rows example

The following remove_rows operation removes the rows whose trial_type column contains either succesful_stop or unsuccesful_stop.

A JSON file with a single remove_rows transformation operation.

[{   
    "operation": "remove_rows",
    "description": "Remove rows where trial_type is either succesful_stop or unsuccesful_stop.",
    "parameters": {
        "column_name": "trial_type",
        "remove_values": ["succesful_stop", "unsuccesful_stop"]
    }
}]

The results of executing the previous remove_rows operation on the sample remodel event file are:

The results of executing the previous remove_rows operation.

onset

duration

trial_type

stop_signal_delay

response_time

response_accuracy

response_hand

sex

0.0776

0.5083

go

n/a

0.565

correct

right

female

9.5856

0.5084

go

n/a

0.45

correct

right

female

21.6103

0.5083

go

n/a

0.443

correct

left

male

After removing rows with trial_type equal to succesful_stop or unsuccesful_stop only the three go trials remain.

Rename columns

The rename_columns operations uses a dictionary to map old column names into new ones.

Rename columns parameters

Parameters for rename_columns.

Parameter

Type

Description

column_mapping

dict

The keys are the old column names and the values are the new names.

ignore_missing

bool

If false, a KeyError is raised if a dictionary key is not a column name.

If ignore_missing is false, a KeyError is raised if a column specified in the mapping does not correspond to a column name in the data file.

Rename columns example

The following example renames the stop_signal_delay column to be stop_delay and the response_hand to be hand_used.

A JSON file with a single rename_columns transformation operation.

[{   
    "operation": "rename_columns",
    "description": "Rename columns to be more descriptive.",
    "parameters": {
        "column_mapping": {
            "stop_signal_delay": "stop_delay",
            "response_hand": "hand_used"
        },
        "ignore_missing": true
    }
}]

The results of executing the previous rename_columns operation on the sample remodel event file are:

After the rename_columns operation is executed, the sample events file is:

onset

duration

trial_type

stop_delay

response_time

response_accuracy

hand_used

sex

0.0776

0.5083

go

n/a

0.565

correct

right

female

5.5774

0.5083

unsuccesful_stop

0.2

0.49

correct

right

female

9.5856

0.5084

go

n/a

0.45

correct

right

female

13.5939

0.5083

succesful_stop

0.2

n/a

n/a

n/a

female

17.1021

0.5083

unsuccesful_stop

0.25

0.633

correct

left

male

21.6103

0.5083

go

n/a

0.443

correct

left

male

Reorder columns

The reorder_columns operation reorders the indicated columns in the specified order. This operation is often used to place the most important columns near the beginning of the file for readability or to assure that all the data files in dataset have the same column order. Additional parameters control how non-specified columns are treated.

Reorder columns parameters

Parameters for the reorder_columns operation.

Parameter

Type

Description

column_order

list

A list of columns in the order they should appear in the data.

ignore_missing

bool

Controls handling column names in the reorder list that aren’t in the data.

keep_others

bool

Controls handling of columns not in the reorder list.

If ignore_missing is true and items in the reorder list do not exist in the file, the missing columns are ignored. On the other hand, if ignore_missing is false, a column name in the reorder list that is missing from the data raises a ValueError.

The keep_others parameter controls whether columns in the data that do not appear in the column_order list are dropped (keep_others is false) or put at the end in the relative order that they appear in the file (keep_others is true).

BIDS event files are required to have onset and duration as the first and second columns, respectively.

Reorder columns example

The reorder_columns operation in the following example specifies that the first four columns of the dataset should be: onset, duration, response_time, and trial_type. Since keep_others is false, these will be the only columns retained.

A JSON file with a single reorder_columns transformation operation.

[{   
    "operation": "reorder_columns",
    "description": "Reorder columns.",
    "parameters": {
        "column_order": ["onset", "duration", "response_time",  "trial_type"],
        "ignore_missing": true,
        "keep_others": false
    }
}]

The results of executing the previous reorder_columns transformation on the sample remodel event file are:

Results of reorder_columns.

onset

duration

response_time

trial_type

0.0776

0.5083

0.565

go

5.5774

0.5083

0.49

unsuccesful_stop

9.5856

0.5084

0.45

go

13.5939

0.5083

n/a

succesful_stop

17.1021

0.5083

0.633

unsuccesful_stop

21.6103

0.5083

0.443

go

Split rows

The split_rows operation is often used to convert event files from trial-level encoding to event-level encoding. This operation is meant only for tabular files that have onset and duration columns.

In trial-level encoding, all the events in a single trial (usually some variation of the cue-stimulus-response-feedback-ready sequence) are represented by a single row in the data file. Often, the onset corresponds to the presentation of the stimulus, and the other events are not reported or are implicitly reported.

In event-level encoding, each row represents the temporal marker for a single event. In this case a trial consists of a sequence of multiple events.

Split rows parameters

Parameters for the split_rows operation.

Parameter

Type

Description

anchor_column

str

The name of the column that will be used for split_rows codes.

new_events

dict

Dictionary whose keys are the codes to be inserted as new events
in the anchor_column and whose values are dictionaries with
keys onset_source, duration, and copy_columns (Optional).

remove_parent_event

bool

If true, remove parent event.

The split_rows operation requires an anchor_column, which could be an existing column or a new column to be appended to the data. The purpose of the anchor_column is to hold the codes for the new events.

The new_events dictionary has the new events to be created. The keys are the new event codes to be inserted into the anchor_column. The values in new_events are themselves dictionaries. Each of these dictionaries has three keys:

  • onset_source is a list of items to be added to the onset of the event row being split to produce the onset column value for the new event. These items can be any combination of numerical values and column names.

  • duration a list of numerical values and/or column names whose values are to be added to compute the duration column value for the new event.

  • copy_columns a list of column names whose values should be copied into each new event. Unlisted columns are filled with n/a.

The split_rows operation sorts the split rows by the onset column and raises a TypeError if the onset and duration are improperly defined. The onset column is converted to numeric values as part splitting process.

Split rows example

The split_rows operation in the following example specifies that new rows should be added to encode the response and stop signal. The anchor column is trial_type.

A JSON file with a single split_rows transformation operation.

[{
  "operation": "split_rows",
  "description": "add response events to the trials.",
        "parameters": {
            "anchor_column": "trial_type",
            "new_events": {
                "response": {
                    "onset_source": ["response_time"],
                    "duration": [0],
                    "copy_columns": ["response_accuracy", "response_hand", "sex", "trial_number"]
                },
                "stop_signal": {
                    "onset_source": ["stop_signal_delay"],
                    "duration": [0.5],
                    "copy_columns": ["trial_number"]
                }
            },	
            "remove_parent_event": false
        }
    }]

The results of executing this split_rows operation on the sample remodel event file are:

Results of the previous split_rows operation.

onset

duration

trial_type

stop_signal_delay

response_time

response_accuracy

response_hand

sex

0.0776

0.5083

go

n/a

0.565

correct

right

female

0.6426

0

response

n/a

n/a

correct

right

female

5.5774

0.5083

unsuccesful_stop

0.2

0.49

correct

right

female

5.7774

0.5

stop_signal

n/a

n/a

n/a

n/a

n/a

6.0674

0

response

n/a

n/a

correct

right

female

9.5856

0.5084

go

n/a

0.45

correct

right

female

10.0356

0

response

n/a

n/a

correct

right

female

13.5939

0.5083

succesful_stop

0.2

n/a

n/a

n/a

female

13.7939

0.5

stop_signal

n/a

n/a

n/a

n/a

n/a

17.1021

0.5083

unsuccesful_stop

0.25

0.633

correct

left

male

17.3521

0.5

stop_signal

n/a

n/a

n/a

n/a

n/a

17.7351

0

response

n/a

n/a

correct

left

male

21.6103

0.5083

go

n/a

0.443

correct

left

male

22.0533

0

response

n/a

n/a

correct

left

male

In a full processing example, it might make sense to rename trial_type to be event_type and to delete the response_time and the stop_signal_delay columns, since these items have been unfolded into separate events. This could be accomplished in subsequent clean-up operations.

Remodel summarizations

Summarizations differ transformations in two respects: they do not modify the input data file, and they keep information about the results from each file that has been processed. Summarization operations may be used at several points in the operation list as checkpoints during debugging as well as for their more typical informational uses.

All summary operations have two required parameters: summary_name and summary_filename.

The summary_name is the unique key used to identify the particular incarnation of this summary in the dispatcher. Care should be taken to make sure that the summary_name is unique within a given JSON remodeling file if the same summary operation is used more than once within the file (e.g. for before and after summary information).

The summary_filename should also be unique and is used for saving the summary upon request. When the remodeler is applied to full datasets rather than single files, the summaries are saved in the derivatives/remodel/summaries directory under the dataset root. A time stamp and file extension are appended to the summary_filename when the summary is saved.

Summarize column names

The summarize_column_names tracks the unique column name patterns found in data files across the dataset and which files have these column names. This summary is useful for determining whether there are any non-conforming data files.

Often event files associated with different tasks have different column names, and this summary can be used to verify that the files corresponding to the same task have the same column names.

A more problematic issue is when some event files for the same task have reordered column names or use different column names.

Summarize column names parameters

The summarize_column_names operation has no parameters and only requires the summary_name and the summary_filename to specify the operation.

The summarize_column_names operation only has the two parameters required of all summaries.

Parameters for the summarize_column_names operation.

Parameter

Type

Description

summary_name

str

A unique name used to identify this summary.

summary_filename

str

A unique file basename to use for saving this summary.

append_timecode

bool

(Optional: Default false) If true, append a time code to filename.

Summarize column names example

The following example remodeling file produces a summary, which when saved will appear with file name AOMIC_column_names_xxx.txt or AOMIC_column_names_xxx.json where xxx is a timestamp.

A JSON file with a single summarize_column_names summarization operation.

[{
    "operation": "summarize_column_names",
    "description": "Summarize column names.",
    "parameters": {
        "summary_name": "AOMIC_column_names",
        "summary_filename": "AOMIC_column_names"
    }    
}]

When this operation is applied to the sample remodel event file, the following text summary is produced.

Result of applying summarize_column_names to the sample remodel file.

Summary name: AOMIC_column_names
Summary type: column_names
Summary filename: AOMIC_column_names

Summary details:

Dataset: Number of files=1
    Columns: ['onset', 'duration', 'trial_type', 'stop_signal_delay', 'response_time', 'response_accuracy', 'response_hand', 'sex']
        sub-0013_task-stopsignal_acq-seq_events.tsv

Individual files:

sub-0013_task-stopsignal_acq-seq_events.tsv: 
   ['onset', 'duration', 'trial_type', 'stop_signal_delay', 'response_time', 'response_accuracy', 'response_hand', 'sex']
		

Since we are only summarizing one event file, there is only one unique pattern – corresponding to the columns: onset, duration, trial_type, stop_signal_delay, response_time, response_accuracy, response_hand, and response_time.

When the dataset has multiple column name patterns, the summary lists unique pattern separately along with the names of the data files that have this pattern.

The JSON version of the summary is useful for programmatic manipulation, while the text version shown above is more readable.

Summarize column values

The summarize column values operation provides a summary of the number of times various column values appear in event files across the dataset.

Summarize column values parameters

The following table lists the parameters required for using the summary.

Parameters for the summarize_column_values operation.

Parameter

Type

Description

summary_name

str

A unique name used to identify this summary.

summary_filename

str

A unique file basename to use for saving this summary.

append_timecode

bool

(Optional: Default false) If True, append a time code to filename.

max_categorical

int

(Optional: Default 50) If given, the text summary shows top max_categorical values.
Otherwise the text summary displays all categorical values.

skip_columns

list

(Optional) A list of column names to omit from the summary.

value_columns

list

(Optional) A list of columns to omit the listing unique values.

values_per_line

int

(Optional: Default 5) If given, the text summary displays this
number of values per line (default is 5).

In addition to the standard parameters, summary_name and summary_filename required of all summaries, the summarize_column_values operation requires two additional lists to be supplied. The skip_columns list specifies the names of columns to skip entirely in the summary. Typically, the onset, duration, and sample columns are skipped, since they have unique values for each row and their values have limited information.

The summarize_column_values is mainly meant for creating summary information about columns containing a finite number of distinct values. Columns that contain numeric information will usually have distinct entries for each row in a tabular file and are not amenable to such summarization. These columns could be specified as skip_columns, but another option is to designate them as value_columns. The value_columns are reported in the summary, but their distinct values are not reported individually.

For datasets that include multiple tasks, the event values for each task may be distinct. The summarize_column_values operation does not separate by task, but expects the calling programs filter the files by task as desired. The run_remodel program supports selecting files corresponding to a particular task.

Two additional optional parameters are available for specifying aspects of the text summary output. The max_categorical optional parameter specifies how many unique values should be displayed for each column. The values_per_line controls how many categorical column values (with counts) are displayed on each line of the output. By default, 5 values are displayed.

Summarize column values example

The following example shows the JSON for including this operation in a remodeling file.

A JSON file with a single summarize_column_values summarization operation.

[{
   "operation": "summarize_column_values",
   "description": "Summarize the column values in an excerpt.",
   "parameters": {
       "summary_name": "AOMIC_column_values",
       "summary_filename": "AOMIC_column_values",
       "skip_columns": ["onset", "duration"],
       "value_columns": ["response_time", "stop_signal_delay"]
   }
}]

A text format summary of the results of executing this operation on the sample remodel event file is shown in the following example.

Sample summarize_column_values operation results in text format.

Summary name: AOMIC_column_values
Summary type: column_values
Summary filename: AOMIC_column_values

Overall summary:
Dataset: Total events=6 Total files=1
   Categorical column values[Events, Files]:
      response_accuracy:
         correct[5, 1] n/a[1, 1]
      response_hand:
         left[2, 1] n/a[1, 1] right[3, 1]
      sex:
         female[4, 1] male[2, 1]
      trial_type:
         go[3, 1] succesful_stop[1, 1] unsuccesful_stop[2, 1]
   Value columns[Events, Files]:
      response_time[6, 1]
      stop_signal_delay[6, 1]

Individual files:

sub-0013_task-stopsignal_acq-seq_events.tsv:
Total events=200
      Categorical column values[Events, Files]:
         response_accuracy:
            correct[5, 1] n/a[1, 1]
         response_hand:
            left[2, 1] n/a[1, 1] right[3, 1]
         sex:
            female[4, 1] male[2, 1]
         trial_type:
            go[3, 1] succesful_stop[1, 1] unsuccesful_stop[2, 1]
      Value columns[Events, Files]:
         response_time[6, 1]
         stop_signal_delay[6, 1]

Because the sample remodel event file only has 6 events, we expect that no value will be represented in more than 6 events. The column names corresponding to value columns just have the event counts in them.

This command was executed with the -i option in run_remodel, results from the individual data files are shown after the overall summary. The individual results are similar to the overall summary because only one data file was processed.

For a more extensive example see the text and JSON format summaries of the sample dataset ds003645s_hed using the summarize_columns_rmdl.json remodeling file.

Summarize definitions

The summarize definitions operation provides a summary of the Def-expand tags found across the dataset, nothing any ambiguous or erroneous ones. If working on a BIDS dataset, it will initialize with the known definitions from the sidecar, reporting any deviations from the known definitions as errors.

Summarize definitions parameters

NOTE: This summary is still under development The following table lists the parameters required for using the summary.

Parameters for the summarize_definitions operation.

Parameter

Type

Description

summary_name

str

A unique name used to identify this summary.

summary_filename

str

A unique file basename to use for saving this summary.

append_timecode

bool

(Optional: Default false) If true, append a time code to filename.

The summarize_definitions is mainly meant for verifying consistency in unknown Def-expand tags. This comes up where you have an assembled dataset, but no longer have the definitions stored (or never created them to begin with).

Summarize definitions example

The following example shows the JSON for including this operation in a remodeling file.

A JSON file with a single summarize_definitions summarization operation.

[{
   "operation": "summarize_definitions",
   "description": "Summarize the definitions used in this dataset.",
   "parameters": {
       "summary_name": "HED_column_definition_summary",
       "summary_filename": "HED_column_definition_summary"
   }
}]

A text format summary of the results of executing this operation on the sub-003_task-FacePerception_run-3_events.tsv file of the eeg_ds_003645s_hed_column dataset is shown in the following example.

Sample summarize_definitions operation results in text format.

Summary name: HED_column_definition_summary
Summary type: definitions
Summary filename: HED_column_definition_summary

Overall summary:
   Known Definitions: 17 items
      cross-only: 2 items
         description: A white fixation cross on a black background in the center of the screen.
         contents: (Visual-presentation,(Background-view,Black),(Foreground-view,(Center-of,Computer-screen),(Cross,White)))
      face-image: 2 items
         description: A happy or neutral face in frontal or three-quarters frontal pose with long hair cropped presented as an achromatic foreground image on a black background with a white fixation cross superposed.
         contents: (Visual-presentation,(Background-view,Black),(Foreground-view,((Center-of,Computer-screen),(Cross,White)),(Grayscale,(Face,Hair,Image))))
      circle-only: 2 items
         description: A white circle on a black background in the center of the screen.
         contents: (Visual-presentation,(Background-view,Black),(Foreground-view,((Center-of,Computer-screen),(Circle,White))))
      press-left-finger: 2 items
         description: The participant presses a key with the left index finger to indicate a face symmetry judgment.
         contents: ((Index-finger,(Experiment-participant,Left-side-of)),(Keyboard-key,Press))
      press-right-finger: 2 items
         description: The participant presses a key with the right index finger to indicate a face symmetry evaluation.
         contents: ((Index-finger,(Experiment-participant,Right-side-of)),(Keyboard-key,Press))
      famous-face-cond: 2 items
         description: A face that should be recognized by the participants
         contents: (Condition-variable/Face-type,(Image,(Face,Famous)))
      unfamiliar-face-cond: 2 items
         description: A face that should not be recognized by the participants.
         contents: (Condition-variable/Face-type,(Image,(Face,Unfamiliar)))
      scrambled-face-cond: 2 items
         description: A scrambled face image generated by taking face 2D FFT.
         contents: (Condition-variable/Face-type,(Image,(Disordered,Face)))
      first-show-cond: 2 items
         description: Factor level indicating the first display of this face.
         contents: ((Condition-variable/Repetition-type,Item-interval/0,(Face,Item-count/1)))
      immediate-repeat-cond: 2 items
         description: Factor level indicating this face was the same as previous one.
         contents: ((Condition-variable/Repetition-type,Item-interval/1,(Face,Item-count/2)))
      delayed-repeat-cond: 2 items
         description: Factor level indicating face was seen 5 to 15 trials ago.
         contents: (Condition-variable/Repetition-type,(Face,Item-count/2),(Item-interval,(Greater-than-or-equal-to,Item-interval/5)))
      left-sym-cond: 2 items
         description: Left index finger key press indicates a face with above average symmetry.
         contents: (Condition-variable/Key-assignment,((Asymmetrical,Behavioral-evidence),(Index-finger,(Experiment-participant,Right-side-of))),((Behavioral-evidence,Symmetrical),(Index-finger,(Experiment-participant,Left-side-of))))
      right-sym-cond: 2 items
         description: Right index finger key press indicates a face with above average symmetry.
         contents: (Condition-variable/Key-assignment,((Asymmetrical,Behavioral-evidence),(Index-finger,(Experiment-participant,Left-side-of))),((Behavioral-evidence,Symmetrical),(Index-finger,(Experiment-participant,Right-side-of))))
      face-symmetry-evaluation-task: 2 items
         description: Evaluate degree of image symmetry and respond with key press evaluation.
         contents: (Experiment-participant,Task,(Discriminate,(Face,Symmetrical)),(Face,See),(Keyboard-key,Press))
      blink-inhibition-task: 2 items
         description: Do not blink while the face image is displayed.
         contents: (Experiment-participant,Inhibit-blinks,Task)
      fixation-task: 2 items
         description: Fixate on the cross at the screen center.
         contents: (Experiment-participant,Task,(Cross,Fixate))
      initialize-recording: 2 items
         description: 
         contents: (Recording)
   Ambiguous Definitions: 0 items

   Errors: 0 items

Since this file didn’t have any ambiguous or incorrect Def-expand groups, those sections are empty. Ambiguous definitions are those that take a placeholder, but it doesn’t have enough information to be sure to which tag the placeholder applies. Erroneous ones are ones with conflicting expanded forms.

Currently, summaries are not generated for individual files, but this is likely to change in the future.

Below is a simple example showing the format when erroneous or ambiguous definitions are found.

Sample input for summarize_definitions operation documenting ambiguous/erroneous definitions.

((Def-expand/Initialize-recording,(Recording)),Onset)
((Def-expand/Initialize-recording,(Recording, Event)),Onset)
(Def-expand/Specify-age/1,(Age/1, Item-count/1))

Sample summarize_definitions operation error results in text format.

Summary name: HED_column_definition_summary
Summary type: definitions
Summary filename: HED_column_definition_summary

Overall summary:
   Known Definitions: 1 items
      initialize-recording: 2 items
         description: 
         contents: (Recording)
   Ambiguous Definitions: 1 items
      specify-age/#: (Age/#,Item-count/#)
   Errors: 1 items
      initialize-recording:
         (Event,Recording)

It is assumed the first definition encountered is the correct definition, unless the first one is ambiguous. Thus, it finds (Def-expand/Initialize-recording,(Recording) and considers it valid, before encountering (Def-expand/Initialize-recording,(Recording, Event)), which is now deemed an error.

Summarize HED tags

The summarize_hed_tags operation extracts a summary of the HED tags present in the annotations of a dataset. This summary operation assumes that the structure in question is suitably annotated with HED (Hierarchical Event Descriptors). You must provide a HED schema version. If the data has annotations in a JSON sidecar, you must also provide its path.

Summarize HED tags parameters

The summarize_hed_tags operation has the two required parameters (tags and expand_context) in addition to the standard summary_name and summary_filename parameters.

Parameters for the summarize_hed_tags operation.

Parameter

Type

Description

summary_name

str

A unique name used to identify this summary.

summary_filename

str

A unique file basename to use for saving this summary.

tags

dict

Dictionary with category title keys and tags in that category as values.

append_timecode

bool

(Optional: Default false) If true, append a time code to filename.

include_context

bool

(Optional: Default true) If true, expand the event context to
account for onsets and offsets.

remove_types

list

(Optional) A list of types such as
Condition-variable and Task to remove.

replace_defs

bool

(Optional: Default true) If true, the Def tags are replaced with the
contents of the definition (no Def or Def-expand).

word_cloud

dict

(Optional) If present, the operation produces a
word cloud image in addition to the summaries.

The tags dictionary has keys that specify how the user wishes the tags to be categorized for display. Note that these keys are titles designating display categories, not HED tags.

The tags dictionary values are lists of actual HED tags (or their children) that should be listed under the respective display categories.

If the optional parameter include_context is true, the counts include tags contributing to the event context in events intermediate between onsets and offsets.

If the optional parameter replace_defs is true, the tag counts include tags contributed by contents of the definitions.

If word_cloud parameter is provided but its value is empty, the default word cloud settings are used. The following table lists the optional parameters used to control the appearance of the word cloud image.

Optional keys in the word cloud dictionary value.

Parameter

Type

Description

background_color

str

The matplotlib name of the background color (default “black”).

contour_color

str

The matplotlib name of the contour color if mask provided.

contour_width

float

Width of contour if mask provided (default 3).

font_path

str

The path of the system font to use in place of the default font.

height

int

Height in pixels of the image (default 300).

mask_path

str

The path of the mask image to use if use_mask is true
and an image other than the brain is needed.

max_font_size

float

The maximum font size to use in the image (default 15).

min_font_size

float

The minimum font size to use in the image (default 8).

prefer_horizontal

float

Fraction of horizontal words in image (default 0.75).

scale_adjustment

float

Constant to add to log10 count transformation (default 7).

use_mask

dict

If true, a mask image is used to provide a contour around the words.

width

int

Width in pixels of image (default 400).

Summarize HED tags example

The following remodeling command specifies that the tag counts should be grouped under the titles: Sensory events, Agent actions, and Objects. Any leftover tags will appear under the title “Other tags”.

A JSON file with a single summarize_hed_tags summarization operation.

[{
   "operation": "summarize_hed_tags",
   "description": "Summarize the HED tags in the dataset.",
   "parameters": {
       "summary_name": "summarize_hed_tags",
       "summary_filename": "summarize_hed_tags",
       "tags": {
           "Sensory events": ["Sensory-event", "Sensory-presentation",
                              "Task-stimulus-role", "Experimental-stimulus"],
           "Agent actions": ["Agent-action", "Agent", "Action", "Agent-task-role",
                             "Task-action-type", "Participant-response"],
           "Objects": ["Item"]
         }
     }
}]

The results of executing this operation on the sample remodel event file are shown below.

Text summary of summarize_hed_tags operation on the sample remodel file.

Summary name: summarize_hed_tags
Summary type: hed_tag_summary
Summary filename: summarize_hed_tags

Overall summary:
Dataset: Total events=1200 Total1 file=6
	Main tags[events,files]:
		Sensory events:
			Sensory-presentation[6,1] Visual-presentation[6,1] Auditory-presentation[3,1]
		Agent actions:
			Incorrect-action[2,1] Correct-action[1,1]
		Objects:
			Image[6,1]
	Other tags[events,files]:
		Label[6,1] Def[6,1] Delay[3,1]

Individual files:

aomic_sub-0013_excerpt_events.tsv:    
Total events=6 
   Main tags[events,files]:
       Sensory events:
          Sensory-presentation[6,1] Visual-presentation[6,1] Auditory-presentation[3,1]
       Agent actions:
          Incorrect-action[2,1] Correct-action[1,1]
       Objects:
          Image[6,1]
   Other tags[events,files]:
       Label[6,1] Def[6,1] Delay[3,1]

The HED tag Task-action-type was specified in the “Agent actions” category, Incorrect-action and Correct-action, which are children of Task-action-type in the HED schema, will appear with counts in the list under this category.

The sample events file had 6 events, including 1 correct action and 2 incorrect actions. Since only one file was processed, the information for Dataset was similar to that presented under Individual files.

For a more extensive example, see the text and JSON format summaries of the sample dataset ds003645s_hed using the summarize_hed_tags_rmdl.json remodeling file.

Summarize HED type

The summarize_hed_type operation is designed to extract experimental design matrices or other experimental structure. This summary operation assumes that the structure in question is suitably annotated with HED (Hierarchical Event Descriptors). The HED conditions and design matrices explains how this works.

Summarize HED type parameters

The summarize_hed_type operation provides detailed information about a specified tag, usually Condition-variable or Task. This summary provides useful information about experimental design.

Parameters for the summarize_hed_type operation.

Parameter

Type

Description

summary_name

str

A unique name used to identify this summary.

summary_filename

str

A unique file basename to use for saving this summary.

type_tag

str

Tag to produce a summary for (most often condition-variable).

append_timecode

bool

(Optional: Default false) If true, append a time code to filename.

In addition to the two standard parameters (summary_name and summary_filename), the type_tag parameter is required. Only one tag can be given, so you must provide a separate operations in the remodel file for multiple type tags.

Summarize HED type example

A JSON file with a single summarize_hed_type summarization operation.

[{
   "operation": "summarize_hed_type",
   "description": "Summarize column names.",
   "parameters": {
       "summary_name": "AOMIC_condition_variables",
       "summary_filename": "AOMIC_condition_variables",
       "type_tag": "condition-variable"
   }
}]

The results of executing this operation on the sample remodel event file are shown below.

Text summary of summarize_hed_types operation on the sample remodel file.

Summary name: AOMIC_condition_variables
Summary type: hed_type_summary
Summary filename: AOMIC_condition_variables

Overall summary:

Dataset: Type=condition-variable Type values=1 Total events=6 Total files=1
   image-sex: 2 levels in 6 event(s)s out of 6 total events in 1 file(s)
       female-image-cond [4,1]: ['Female', 'Image', 'Face']
       male-image-cond [2,1]: ['Male', 'Image', 'Face']

Individual files:

aomic_sub-0013_excerpt_events.tsv:
Type=condition-variable Total events=6 
      image-sex: 2 levels in 6 events
         female-image-cond [4 events, 1 files]: 
            Tags: ['Female', 'Image', 'Face']
         male-image-cond [2 events, 1 files]: 
            Tags: ['Male', 'Image', 'Face']

Because summarize_hed_type is a HED operation, a HED schema version is required and a JSON sidecar is also usually needed. This summary was produced by using hed_version="8.1.0" when creating the dispatcher and using the sample remodel sidecar file in the do_op. The sidecar provides the annotations that use the condition-variable tag in the summary.

For a more extensive example, see the text and JSON format summaries of the sample dataset ds003645s_hed using the summarize_hed_types_rmdl.json remodeling file.

Summarize HED validation

The summarize_hed_validation operation runs the HED validator on the requested data and produces a summary of the errors. See the HED validation guide for available methods of running the HED validator.

Summarize HED validation parameters

In addition to the required summary_name and summary_filename parameters, the summarize_hed_validation operation has a required boolean parameter check_for_warnings. If check_for_warnings is false, the summary will not report warnings.

Parameters for the summarize_hed_validation operation.

Parameter

Type

Description

summary_name

str

A unique name used to identify this summary.

summary_filename

str

A unique file basename to use for saving this summary.

append_timecode

bool

(Optional: Default false) If true, append a time code to filename.

check_for_warnings

bool

(Optional: Default false) If true, warnings are reported in addition to errors.

The summarize_hed_validation is a HED operation and the calling program must provide a HED schema version and usually a JSON sidecar containing the HED annotations.

The validation process takes place in two stages: first the JSON sidecar is validated. This strategy is used because a single error in the JSON sidecar can generate an error message for every line in the corresponding data file.

If the JSON sidecar has errors (warnings don’t count), the validation process is terminated without validation of the data file and assembled HED annotations.

If the JSON sidecar does not have errors, the validator assembles the annotations for each line in the data files and validates the assembled HED annotation. Data file-wide consistency, such as matched onsets and offsets, is also checked.

Summarize HED validation example

A JSON file with a single summarize_hed_validation summarization operation.

[{
   "operation": "summarize_hed_validation",
   "description": "Summarize validation errors in the sample dataset.",
   "parameters": {
       "summary_name": "AOMIC_sample_validation",
       "summary_filename": "AOMIC_sample_validation",
       "check_for_warnings": true
   }
}]

To demonstrate the output of the validation operation, we modified the first row of the sample remodel event file so that trial_type column contained the value baloney rather than go. This modification generates a warning because the meaning of baloney is not defined in the sample remodel sidecar file. The results of executing the example operation with the modified file are shown in the following example.

Text summary of summarize_hed_validation operation on a modified sample data file.

Summary name: AOMIC_sample_validation
Summary type: hed_validation
Summary filename: AOMIC_sample_validation

Summary details:

Dataset: [1 sidecar files, 1 event files]
   task-stopsignal_acq-seq_events.json: 0 issues
   sub-0013_task-stopsignal_acq-seq_events.tsv: 6 issues

Individual files:

   sub-0013_task-stopsignal_acq-seq_events.tsv: 1 sidecar files
      task-stopsignal_acq-seq_events.json has no issues
      sub-0013_task-stopsignal_acq-seq_events.tsv issues:
            HED_UNKNOWN_COLUMN: WARNING: Column named 'onset' found in file, but not specified as a tag column or identified in sidecars.
            HED_UNKNOWN_COLUMN: WARNING: Column named 'duration' found in file, but not specified as a tag column or identified in sidecars.
            HED_UNKNOWN_COLUMN: WARNING: Column named 'response_time' found in file, but not specified as a tag column or identified in sidecars.
            HED_UNKNOWN_COLUMN: WARNING: Column named 'response_accuracy' found in file, but not specified as a tag column or identified in sidecars.
            HED_UNKNOWN_COLUMN: WARNING: Column named 'response_hand' found in file, but not specified as a tag column or identified in sidecars.
            HED_SIDECAR_KEY_MISSING[row=0,column=2]: WARNING: Category key 'baloney' does not exist in column.  Valid keys are: ['succesful_stop', 'unsuccesful_stop', 'go']

This summary was produced using HED schema version hed_version="8.1.0" when creating the dispatcher and using the sample remodel sidecar file in the do_op.

Summarize sidecar from events

The summarize sidecar from events operation generates a sidecar template from the event files in the dataset.

Summarize sidecar from events parameters

The following table lists the parameters required for using the summary.

Parameters for the summarize_sidcar_from_events operation.

Parameter

Type

Description

summary_name

str

A unique name used to identify this summary.

summary_filename

str

A unique file basename to use for saving this summary.

skip_columns

list

A list of column names to omit from the sidecar.

value_columns

list

A list of columns to treat as value columns in the sidecar.

append_timecode

bool

(Optional) If True, append a time code to filename.
False is the default.

The standard summary parameters, summary_name and summary_filename are required. The summary_name is the unique key used to identify the particular incarnation of this summary in the dispatcher. Since a particular operation file may use a given operation multiple times, care should be taken to make sure that it is unique.

The summary_filename should also be unique and is used for saving the summary upon request. When the remodeler is applied to full datasets rather than single files, the summaries are saved in the derivatives/remodel/summaries directory under the dataset root. A time stamp and file extension are appended to the summary_filename when the summary is saved.

In addition to the standard parameters, summary_name and summary_filename required of all summaries, the summarize_sidecar_from_events operation requires two additional lists to be supplied. The skip_columns list specifies the names of columns to skip entirely in generating the sidecar template. The value_columns list specifies the names of columns to treat as value columns when generating the sidecar template.

Summarize sidecar from events example

The following example shows the JSON for including this operation in a remodeling file.

A JSON file with a single summarize_sidecar_from_events summarization operation.

[{
    "operation": "summarize_sidecar_from_events",
    "description": "Generate a sidecar from the excerpted events file.",
    "parameters": {
        "summary_name": "AOMIC_generate_sidecar",
        "summary_filename": "AOMIC_generate_sidecar",
        "skip_columns": ["onset", "duration"],
        "value_columns": ["response_time", "stop_signal_delay"]
    }
}]
  

The results of executing this operation on the sample remodel event file are shown in the following example using the text format.

Sample summarize_sidecar_from_events operation results in text format.

Summary name: AOMIC_generate_sidecar
Summary type: events_to_sidecar
Summary filename: AOMIC_generate_sidecar

Dataset: Currently no overall sidecar extraction is available

Individual files:

aomic_sub-0013_excerpt_events.tsv: Total events=6 Skip columns: ['onset', 'duration']
Sidecar:
{
    "trial_type": {
        "Description": "Description for trial_type",
        "HED": {
            "go": "(Label/trial_type, Label/go)",
            "succesful_stop": "(Label/trial_type, Label/succesful_stop)",
            "unsuccesful_stop": "(Label/trial_type, Label/unsuccesful_stop)"
        },
        "Levels": {
            "go": "Here describe column value go of column trial_type",
            "succesful_stop": "Here describe column value succesful_stop of column trial_type",
            "unsuccesful_stop": "Here describe column value unsuccesful_stop of column trial_type"
        }
    },
    "response_accuracy": {
        "Description": "Description for response_accuracy",
        "HED": {
            "correct": "(Label/response_accuracy, Label/correct)"
        },
        "Levels": {
            "correct": "Here describe column value correct of column response_accuracy"
        }
    },
    "response_hand": {
        "Description": "Description for response_hand",
        "HED": {
            "left": "(Label/response_hand, Label/left)",
            "right": "(Label/response_hand, Label/right)"
        },
        "Levels": {
            "left": "Here describe column value left of column response_hand",
            "right": "Here describe column value right of column response_hand"
        }
    },
    "sex": {
        "Description": "Description for sex",
        "HED": {
            "female": "(Label/sex, Label/female)",
            "male": "(Label/sex, Label/male)"
        },
        "Levels": {
            "female": "Here describe column value female of column sex",
            "male": "Here describe column value male of column sex"
        }
    },
    "response_time": {
        "Description": "Description for response_time",
        "HED": "(Label/response_time, Label/#)"
    },
    "stop_signal_delay": {
        "Description": "Description for stop_signal_delay",
        "HED": "(Label/stop_signal_delay, Label/#)"
    }
}

Remodel implementation

Operations are defined as classes that extend BaseOp regardless of whether they are transformations or summaries. However, summaries must also implement an additional supporting class that extends BaseSummary to hold the summary information.

In order to be executed by the remodeling functions, an operation must appear in the valid_operations dictionary.

Each operation class must have a NAME class variable, specifying the operation name (string) and a PARAMS class variable containing a dictionary of the operation’s parameters represented as a json schema. The operation’s constructor extends the BaseOp class constructor by calling:

A remodel operation class must call the BaseOp constructor first.

   super().__init__(parameters)

A remodel operation class must implement the BaseOp abstract methods do_ops and validate_input_data.

The PARAMS dictionary

The class-wide PARAMS dictionary specifies the required and optional parameters of the operation as a JSON schema. We currently use draft-2020-12. The basic vocabulary allows specifying the type of parameters that are expected and whether a parameter is required or optional.

It is also possible to add dependencies between parameters. More information can be found in the JSON schema documentation.

On the highest level the type should always be specified as an object, as the parameters are always provided as a dictionary or json object. Under the properties key, the expected parameters should be listed, along with what datatype is expected for every parameter. The specifications can be nested, for example, the rename_columns operation requires a parameter column_mapping, which should be a JSON object whose keys are any valid string, and whose values are also strings. This is represented in the following way:

{
        "type": "object",
        "properties": {
            "column_mapping": {
                "type": "object",
                "patternProperties": {
                    ".*": {
                        "type": "string"
                    }
                },
                "minProperties": 1
            },
            "ignore_missing": {
                "type": "boolean"
            }
        },
        "required": [
            "column_mapping",
            "ignore_missing"
        ],
        "additionalProperties": false
    }

The PARAMS dictionaries for all available operations are read by the validator and compiled into a single JSON schema which represents the specification for remodeler files. The properties dictionary explicitly specifies the parameters that are allowed for this operation. The required list specifies which parameters must be included when calling operation. Parameters that are not required may be omitted in the operation call. The "additionalProperties": false is the way that JSON schema use to indicate that no other parameters are allowed in the call to the operation.

A limitation to JSON schema representations is that although it can handle specific dependencies between keys in the data, it cannot validate the data that is provided in the JSON file against other data in the JSON file. For example, if the requirement is a list of elements whose length should be specified by another parameter, JSON schema does not provide a vocabulary for setting this dependency. Instead, we handle these type of dependencies in the validate_input_data method.

Operation class constructor

All the operation classes have constructors that start with a call to the superclass constructor BaseOp. The following example shows the constructor for the RemoveColumnsOp class.

The class-wide PARAMS dictionary for the RemoveColumnsOp class.

    def __init__(self, parameters):
        super().__init__(parameters)
        self.column_names = parameters['column_names']
        ignore_missing = parameters['ignore_missing']
        if ignore_missing:
            self.error_handling = 'ignore'
        else:
            self.error_handling = 'raise'

After the call to the base class constructor, the operation constructor assigns the operation-specific values to class properties. Validation takes place before the operation classes are initialized.

The do_op implementation

The remodeling script is meant to be executed by the Dispatcher, which keeps a compiled version of the remodeling script to execute on each tabular file to be remodeled.

The main method that must be implemented by each operation is do_op, which takes an instance of the Dispatcher class as the first parameter and a Pandas DataFrame representing the tabular file as the second parameter. A third required parameter is a name used to identify the tabular file in error messages and summaries. This name is usually the filename or the filepath from the dataset root. An additional optional argument, a sidecar containing HED annotations, only need be included for HED operations. Note that the Dispatcher is responsible for holding the appropriate version of the HED schema if HED remodeling operations are included.

The following example shows a sample implementation for do_op.

The implementation of do_op for the RemoveColumnsOp class.


    def do_op(self, dispatcher, df, name, sidecar=None):
        return df.drop(self.remove_names, axis=1, errors=self.error_handling)

The do_op in this case is a wrapper for the underlying Pandas DataFrame operation for removing columns.

IMPORTANT NOTE: The do_op operation always assumes that n/a values have been replaced by numpy.NaN values in the incoming dataframe df. The Dispatcher class has a static method prep_data that does this replacement. At the end of running all the remodeling operations on a data file Dispatcher run_operations method replaces all of the numpy.NaN values with n/a, the value expected by BIDS. This operation is performed by the Dispatcher static method post_proc_data.

The validate_input_data implementation

This method exist to handle additional input data validation that cannot be specified in JSON schema. It is a class method which is called by the validator. If there is no additional validation to be done, a minimal implementation of this method should take in a dictionary with the operation parameters and return an empty list. In case additional validation is required, the method should directly implement validation and return a list of user-friendly error messages (strings) if validation fails, or an empty list if there are no errors.

The following implementation of validate_input_data method, for the factor_hed_tags operation, checks whether the parameter query_names is the same length as the input for parameter queries, since the names specified in the first parameter are meant to represent the queries provided in the latter. The check only takes place if query_names exists, since naming is handled automatically otherwise.

@staticmethod
def validate_input_data(parameters):
    errors = []
    if parameters.get("query_names", False):
        if len(parameters.get("query_names")) != len(parameters.get("queries")):
            errors.append("The list in query_names, in the factor_hed_tags operation, should have the same number of items as queries.")
    return errors

The do_op for summarization

The do_op operation for summarization operations has a slightly different form, as it serves primarily as a wrapper for the actual summary information as illustrated by the following example.

The implementation of do_op for SummarizeColumnNamesOp.

    def do_op(self, dispatcher, df, name, sidecar=None):
        summary = dispatcher.summary_dict.get(self.summary_name, None)
        if not summary:
            summary = ColumnNameSummary(self)
            dispatcher.summary_dict[self.summary_name] = summary
        summary.update_summary({"name": name, "column_names": list(df.columns)})
        return df

A do_op operation for a summarization checks the dispatcher to see if the summary name is already in the dispatcher’s summary_dict. If that summary is not yet in the summary_dict, the operation creates a BaseSummary object for its summary (e.g., ColumnNameSummary) and adds this object to the dispatcher’s summary_dict, otherwise the operation fetches the BaseSummary object from the dispatcher’s summary_dict. It then asks this BaseSummary object to update the summary based on the DataFrame as explained in the next section.

Additional requirements for summarization

Any summary operation must implement a supporting class that extends BaseSummary. This class is used to hold and accumulate the information specific to the summary. This support class must implement two methods: update_summary and get_summary_details.

The update_summary method is called by its associated BaseOp operation during the do_op to update the summary information based on the current DataFrame. The update_summary information takes a single parameter, which is a dictionary of information specific to this operation.

The update_summary method required to be implemented by all BaseSummary objects.

  def update_summary(self, summary_dict)

In the example do_op for ColumnNamesOp, the dictionary contains keys for name and `column_names.

The get_summary_details returns a dictionary with the summary-specific information currently in the summary. The BaseSummary provides universal methods for converting this summary to JSON or text format.

The get_summary_details method required to be implemented by all BaseSummary objects.

  get_summary_details(self, verbose=True)

The operation associated with this instance of it associated with a given format implementation

Validator implementation

The required input for the remodeler is specified in JSON format and must follow the rules laid out by the JSON schema. The parameters in the remodeler file must conform to the properties specified in the corresponding JSON schema associated with each operation. The errors are retrieved from the validator but are not passed on directly but instead modified for display as user-friendly error messages.

Validation errors are organized by stages as follows.

Stage 0: top-level structure

Stage 0 refers to the top-level structure of the remodel JSON file. As specified by the validator’s BASE_ARRAY, a JSON remodel file must be an array of operation dictionaries containing at least 1 item.

Stage 1: operation dictionary format

Stage 1 validation refers the structure of the individual operations as specified by the validator’s OPERATION_DICT. Every operation dictionary should have exactly the keys: operation, description, and parameters.

Stage 2: operation dictionary values

Stage 2 validates the values associated with the keys in each operation dictionary. The operation and description keys should have a string values, while the parameters key should have a dictionary value.

Stage 2 validation also verifies that the operation value is one of the valid operations as enumerated in the valid_operations dictionary.

Several checks are also applied to the parameters dictionary. The properties listed as required in the schema must appear as keys in the parameters dictionary.

If additional properties are not allowed, as designated by "additionalProperties": False in the JSON schema, the validator verifies that parameters not mentioned in the schema do not appear. Note this is currently true for all operations and recommended for new operations.

If the schema for the operation has a dependentRequired dictionary, the validator verifies that the indicated keys are present if the listed parameter values are also present. For example, the factor_column_op only allows the factor_names parameter if the factor_values parameter is also present. In this case the dependency works only one way, such that factor_values can be provided without factor_names. If factor_names is provided alone the operation automatically generates the factor names based on the column names, however, without factor_values the names provided in factor_names do not correspond to anything, so this key cannot appear on its own.

Later validation stages

Later stages in validation concern the values given within the parameter object, which can be nested to an arbitrary level and are handled in a general way. The user is provided with the operation index, name and the ‘path’ of the value that is invalid. Note that while parameters always contains an object, the values in parameters can be of any type. Thus, parameter values can be objects whose values might also be expected to be objects, arrays, or arrays of objects. The validator has appropriate messages for many of the conditions that can be set with json schema, but if an new operation parameter has a condition that has not been used yet, a new error message will need to be added to the validator.

When validation against JSON schema passes, the validator performs additional data-specific validation by calling validate_input_data for each operation to verify that input data satisfies the constraints that fall outside the scope of JSON schema. Also see The validate_input_data implementation and The PARAMS dictionary sections for additional information.