Package 'cohortflow'

Title: Define and Apply Cohort Inclusion/Exclusion Criteria
Description: Define inclusion and exclusion criteria for cohort studies using formulas or functions, apply them to data, and render the resulting attrition flow as CONSORT flow diagrams, tables, or narrative text. Supports hierarchical designs (participants nested in clusters/sites) and flexible grouping for reporting. Criteria pipelines can be serialised to YAML for reproducibility and sharing. Parts of this package were developed with the assistance of GitHub Copilot (powered by Claude), an AI coding assistant.
Authors: Matthew Moore [aut, cre] (ORCID: <https://orcid.org/0000-0003-0730-8027>)
Maintainer: Matthew Moore <[email protected]>
License: MIT + file LICENSE
Version: 0.0.0.9009
Built: 2026-05-28 20:31:32 UTC
Source: https://github.com/mattmoo/cohortflow

Help Index


Apply a criteria pipeline to a data frame

Description

Evaluates each step in a cf_criteria pipeline cumulatively against a data frame, recording which rows are removed at each step. Returns a cf_flow object from which the surviving cohort (cohort()) and excluded rows (excluded()) can be extracted.

Usage

apply_criteria(data, criteria, id = NULL)

Arguments

data

A data frame.

criteria

A cf_criteria object.

id

A single column name (string) to use as the row identifier. If NULL (default), the function looks for .cf_row_id in data; if not found, a sequential integer is added.

Value

A cf_flow object.

Row identity

apply_criteria() needs a stable row identifier to track exclusions. By default it looks for a .cf_row_id column in data. If one is not found and id is NULL, a sequential integer identifier is added automatically (with a message). Supply id to use an existing column as the identifier.

Step types

Type Predicate Effect
include Row-wise logical vector Keep TRUE rows
exclude Row-wise logical vector Drop TRUE rows
group_include Per-group scalar (summarise context) Keep rows in groups where result is TRUE
group_exclude Per-group scalar (summarise context) Drop rows in groups where result is TRUE
select_within Per-group logical vector Keep TRUE rows within each group

Examples

dat  <- mock_cohortflow(n_participants = 200, seed = 1)
crit <- cf_criteria() |>
  include(~ eligible_screen, label = "Passed screening") |>
  include(~ !is.na(consent_date), label = "Consent recorded") |>
  exclude(~ withdrew, label = "Withdrew consent") |>
  group_include(by = "cluster_id", ~ n() >= 5,
                label = "Cluster size >= 5")

flow <- apply_criteria(dat, crit)
flow
cohort(flow)
excluded(flow)

Format an attrition table from a cohort flow object

Description

Produces a formatted attrition table from a cf_flow object. Three output backends are supported: "flextable" (best for Word), "gt" (best for HTML and LaTeX), and "huxtable". The underlying data is built by as_attrition_tibble(), which you can call directly for custom formatting.

Usage

as_attrition_table(
  flow,
  backend = c("flextable", "gt", "huxtable"),
  show_categories = TRUE,
  assessed_label = "Assessed for eligibility",
  final_label = "Final cohort",
  digits = 1L,
  criterion_col_label = "Criterion",
  n_col_label = "N",
  removed_col_label = "Removed",
  pct_col_label = "% removed"
)

Arguments

flow

A cf_flow object produced by apply_criteria().

backend

Character string: "flextable" (default), "gt", or "huxtable".

show_categories

Logical. When TRUE (default) steps with the same category are collapsed under a single parent row. When FALSE one row per step is produced.

assessed_label

Character string for the header row. Default "Assessed for eligibility".

final_label

Character string for the trailing final-cohort row. Default "Final cohort".

digits

Integer. Number of decimal places for pct_removed. Default 1.

criterion_col_label

Column header for the criterion/label column. Default "Criterion".

n_col_label

Column header for the N column. Default "N".

removed_col_label

Column header for the removed column. Default "Removed".

pct_col_label

Column header for the percentage column. Default "% removed".

Value

A flextable, gt_tbl, or huxtable object, ready to print or include in a document.

Output formats

  • "flextable" — uses the flextable and officer packages. Renders natively to Word (.docx) via officer::read_docx(), to PDF via flextable::save_as_image(), and to HTML. Best choice when Word is the primary target.

  • "gt" — uses the gt package. Renders to HTML (gt::gtsave(..., "table.html")), LaTeX (gt::as_latex()), and Word (via gt::gtsave(..., "table.docx") requires the webshot2 package). Best choice when LaTeX or HTML is the primary target.

  • "huxtable" — uses the huxtable package. Supports Word, LaTeX, and HTML output.

See Also

as_attrition_tibble() for the underlying plain-tibble data layer.

Examples

## Not run: 
dat  <- mock_cohortflow(n_participants = 200, seed = 1)
crit <- cf_criteria() |>
  include(~ !is.na(age),     label = "Age recorded",    category = "Age") |>
  include(~ age >= 18,       label = "Adults only",      category = "Age") |>
  include(~ eligible_screen, label = "Passed screening") |>
  exclude(~ withdrew,        label = "Withdrew consent")
flow <- apply_criteria(dat, crit)

as_attrition_table(flow)                       # flextable (Word-ready)
as_attrition_table(flow, backend = "gt")       # gt (LaTeX/HTML-ready)
as_attrition_table(flow, backend = "huxtable") # huxtable

# Save to Word
ft <- as_attrition_table(flow)
flextable::save_as_docx(ft, path = "attrition.docx")

# Save to Word via gt
gt_tbl <- as_attrition_table(flow, backend = "gt")
gt::gtsave(gt_tbl, "attrition.docx")

# LaTeX snippet via gt
gt::as_latex(gt_tbl)

## End(Not run)

Build an attrition tibble from a cohort flow object

Description

Produces a flat tibble describing the attrition at each step (or category) of a cohort flow pipeline. This is the underlying data layer used by as_attrition_table(), and is also useful for custom formatting.

Usage

as_attrition_tibble(
  flow,
  show_categories = TRUE,
  assessed_label = "Assessed for eligibility",
  final_label = "Final cohort",
  digits = 1L
)

Arguments

flow

A cf_flow object produced by apply_criteria().

show_categories

Logical. When TRUE (default) steps with the same category are collapsed under a single parent row. When FALSE one row per step is produced.

assessed_label

Character string for the header row. Default "Assessed for eligibility".

final_label

Character string for the trailing final-cohort row. Default "Final cohort".

digits

Integer. Number of decimal places for pct_removed. Default 1.

Value

A tibble::tibble() with columns:

row_type

"header", "category", "step", or "final"

label

Display text for the row

indent_level

0 = header/final, 1 = category or top-level step, 2 = sub-step under a category

n

Number of participants at this point (N entering for category/step rows; N surviving for the final row)

n_removed

Number removed at this step (NA for header/final)

pct_removed

Percentage removed relative to entering N (NA for header; for final row: percentage retained of original N)

Row structure

The tibble always starts with an "assessed" header row and ends with a "final cohort" row. When show_categories = TRUE (the default), steps that share a category are collapsed into a parent row (indent level 1) with individual step sub-rows (indent level 2) nested beneath it. Uncategorised steps appear at indent level 1 without sub-rows. When show_categories = FALSE, one row per step is produced at indent level 1.

Percentage columns

  • pct_removed: percentage of entering N removed at this row's step (or category). For the final cohort row this is the percentage retained relative to the initial N (i.e. n / n_start * 100).

  • Percentages are relative to n_in — the number entering that step or, for a category row, the number entering the first step in that category.

See Also

as_attrition_table() for formatted table output.

Examples

dat  <- mock_cohortflow(n_participants = 200, seed = 1)
crit <- cf_criteria() |>
  include(~ !is.na(age),        label = "Age recorded",       category = "Age") |>
  include(~ age >= 18,          label = "Adults only",         category = "Age") |>
  include(~ eligible_screen,    label = "Passed screening",    category = "Screening") |>
  exclude(~ withdrew,           label = "Withdrew consent")
flow <- apply_criteria(dat, crit)
as_attrition_tibble(flow)

Create an eligibility criteria pipeline

Description

cf_criteria() initialises an empty criteria pipeline. Use the pipe operators include(), exclude(), group_include(), group_exclude(), and select_within() to add steps.

Usage

cf_criteria(...)

Arguments

...

Optional cf_criterion objects to include at construction time.

Value

A cf_criteria object.

Examples

has_consent <- function(d) !is.na(d$consent_date)

crit <- cf_criteria() |>
  include(~ age >= 18, label = "Adults only") |>
  include(has_consent, label = "Consent recorded") |>
  exclude(~ withdrew, label = "Withdrew consent") |>
  group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5") |>
  select_within(by = "participant_id", ~ consent_date == min(consent_date, na.rm = TRUE),
                label = "Index operation")

crit

Create a single cohort criterion

Description

A cf_criterion represents one inclusion or exclusion step in a cohort eligibility pipeline. Five step types are supported (see type). The predicate is always a one-sided formula or a function; the interpretation depends on the step type.

Usage

cf_criterion(
  predicate,
  label,
  type = c("include", "exclude", "group_include", "group_exclude", "select_within"),
  by = NULL,
  category = NULL
)

Arguments

predicate

A one-sided formula or a function.

  • For include / exclude: evaluated row-wise; must return a logical vector of length nrow(data).

  • For group_include / group_exclude: evaluated in a dplyr::summarise() context per group (defined by by); must return a scalar logical. The result is broadcast back to all rows in each group.

  • For select_within: evaluated per group; must return a logical vector identifying which rows within the group to keep.

label

A short human-readable description of this criterion.

type

One of "include", "exclude", "group_include", "group_exclude", or "select_within".

by

For grouped step types (group_include, group_exclude, select_within): a single character string naming the grouping column. Ignored for row-wise types.

category

An optional character string grouping this criterion with others for display purposes (e.g. "Age" to group age-related steps into one box in a CONSORT diagram). NULL (default) leaves the step uncategorised (NA in output).

Value

A cf_criterion object (an S3 list).

Examples

# Row-wise inclusion
cf_criterion(~ age >= 18, label = "Adults only", type = "include")

# Grouped under a category
cf_criterion(~ !is.na(age), label = "Age recorded",
             type = "include", category = "Age")

# Group-level inclusion (clusters with >= 5 participants)
cf_criterion(
  ~ n() >= 5,
  label = "Sufficient cluster size",
  type  = "group_include",
  by    = "cluster_id"
)

# Select first operation per patient
cf_criterion(
  ~ consent_date == min(consent_date, na.rm = TRUE),
  label = "Index operation",
  type  = "select_within",
  by    = "participant_id"
)

Declare the nesting hierarchy for a cohort

Description

A cf_hierarchy describes how observational units are nested within each other (e.g., participants within clusters within sites). It is an ordered named character vector mapping role names to column names in the data.

Usage

cf_hierarchy(...)

Arguments

...

Named character scalars mapping role labels to column names, e.g. ⁠participant = "participant_id", cluster = "cluster_id"⁠. At least one entry is required.

Details

The order of ... is from the finest (innermost) level to the coarsest (outermost) level. The first role is conventionally the individual participant.

Value

A cf_hierarchy object (a named character vector with additional class attribute).

Examples

# Two-level: participants within clusters
cf_hierarchy(participant = "pid", cluster = "cid")

# Three-level: participants within clusters within sites
cf_hierarchy(participant = "pid", cluster = "cluster_id", site = "site_id")

Extract the surviving cohort from a flow object

Description

Returns the rows that passed all criteria, with the internal .cf_row_id column removed.

Usage

cohort(flow)

Arguments

flow

A cf_flow object produced by apply_criteria().

Value

A tibble.


Add an exclusion criterion to a criteria pipeline

Description

Add an exclusion criterion to a criteria pipeline

Usage

exclude(criteria, predicate, label, category = NULL)

Arguments

criteria

A cf_criteria object (or NULL to start a new one).

predicate

A one-sided formula or a function evaluated row-wise.

label

A short human-readable description of this criterion.

category

An optional string grouping this step with others for display. NULL leaves it uncategorised.

Value

The updated cf_criteria object.

Examples

cf_criteria() |>
  exclude(~ withdrew, label = "Withdrew consent")

Extract excluded rows from a flow object

Description

Returns a flat tibble of all rows removed at any step, with additional columns cf_step (integer), cf_label (character), cf_type (character), and cf_category (character, NA if the criterion had no category). Rows are in step order.

Usage

excluded(flow)

Arguments

flow

A cf_flow object produced by apply_criteria().

Value

A tibble with original columns plus cf_step, cf_label, cf_type, cf_category.


Serialises a cf_criteria object to a human-readable YAML file (or string). The schema is self-describing and can be re-imported with import_criteria().

Description

Serialises a cf_criteria object to a human-readable YAML file (or string). The schema is self-describing and can be re-imported with import_criteria().

Usage

export_criteria(criteria, path = NULL, fn_refs = list())

Arguments

criteria

A cf_criteria object.

path

A file path to write to. If NULL (default), the YAML text is returned as a character string.

fn_refs

An optional named list mapping function objects to "pkg::name" reference strings.

Value

Invisibly, the criteria object (or the YAML string if path = NULL).

YAML schema

cohortflow_criteria:
  version: 1
  steps:
    - label:  "Adults only"
      type:   include
      kind:   formula
      by:     null
      expr:   "age >= 18"
      fn_ref: null

Formula-based criteria round-trip cleanly. Anonymous function bodies are deparsed as a fallback; complex closures may not round-trip and a warning is issued. Named exported functions can be stored as "pkg::name" references via fn_refs.


Add a group-level exclusion criterion

Description

The predicate is evaluated per group; groups where the predicate returns TRUE are removed (all their rows dropped).

Usage

group_exclude(criteria, by, predicate, label, category = NULL)

Arguments

criteria

A cf_criteria object (or NULL to start a new one).

by

A single column name (string) to group by.

predicate

A one-sided formula using summary functions, or a function.

label

A short human-readable description of this criterion.

category

An optional string grouping this step with others for display. NULL leaves it uncategorised.

Value

The updated cf_criteria object.

Examples

cf_criteria() |>
  group_exclude(by = "cluster_id", ~ mean(is.na(age)) > 0.5,
                label = "Excessive missing age in cluster")

Add a group-level inclusion criterion

Description

The predicate is evaluated in a dplyr::summarise() context per group (when a formula) or receives a grouped data frame (when a function). Groups where the predicate returns TRUE are kept; all rows in failing groups are removed.

Usage

group_include(criteria, by, predicate, label, category = NULL)

Arguments

criteria

A cf_criteria object (or NULL to start a new one).

by

A single column name (string) to group by.

predicate

A one-sided formula using summary functions (n(), mean(), n_distinct(), etc.) or a function that receives a grouped data frame and returns a tibble with columns ⁠<by>⁠ and .pass.

label

A short human-readable description of this criterion.

category

An optional string grouping this step with others for display. NULL leaves it uncategorised.

Value

The updated cf_criteria object.

Examples

cf_criteria() |>
  group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5")

Import a criteria pipeline from YAML

Description

Reads a YAML file (or string) written by export_criteria() and reconstructs a cf_criteria object.

Usage

import_criteria(path = NULL, text = NULL, envir = parent.frame())

Arguments

path

A file path to read from. Exactly one of path or text must be supplied.

text

A YAML string. Exactly one of path or text must be supplied.

envir

The environment in which to evaluate re-parsed expressions.

Value

A cf_criteria object.


Add an inclusion criterion to a criteria pipeline

Description

Add an inclusion criterion to a criteria pipeline

Usage

include(criteria, predicate, label, category = NULL)

Arguments

criteria

A cf_criteria object (or NULL to start a new one).

predicate

A one-sided formula or a function evaluated row-wise.

label

A short human-readable description of this criterion.

category

An optional string grouping this step with others for display (e.g. "Age"). NULL leaves it uncategorised.

Value

The updated cf_criteria object.

Examples

cf_criteria() |>
  include(~ age >= 18, label = "Adults", category = "Age")

Generate synthetic cohort data for testing and examples

Description

Produces a tibble that mimics a clustered cohort study with optional stepped-wedge period structure. The returned data is deliberately imperfect: some participants have missing values, withdrew consent, or failed screening, so that inclusion/exclusion criteria have something to remove.

Usage

mock_cohortflow(
  n_participants = 500L,
  n_clusters = 10L,
  n_sites = 3L,
  n_periods = 4L,
  seed = 123L
)

Arguments

n_participants

Integer. Total number of participant rows.

n_clusters

Integer. Number of clusters (e.g., GP practices, schools).

n_sites

Integer. Number of sites (clusters are nested in sites; must be <= n_clusters).

n_periods

Integer. Number of time periods (e.g., waves in a stepped-wedge design). Set to 1 for a simple cross-sectional cohort.

seed

Integer. Random seed for reproducibility.

Value

A tibble::tibble() with columns:

participant_id

Character. Unique participant identifier.

event_id

Character. Unique event (row) identifier — useful when the dataset has multiple rows per participant (e.g., one per operation).

cluster_id

Character. Cluster identifier.

site_id

Character. Site identifier (clusters nested in sites).

period

Integer. Study period (1 = first period).

sequence

Integer. Stepped-wedge sequence the cluster is assigned to (NA if n_periods == 1).

age

Numeric. Age in years (some NAs to simulate missing data).

age_group

Character. Age group ("<18", "18-39", "40-64", "65+").

sex

Character. "M" / "F" / "O" (other).

ethnicity

Character. One of five broad ethnic groups.

consent_date

Date. Date of consent (NA = did not consent).

baseline_complete

Logical. Whether the baseline assessment is complete.

withdrew

Logical. Whether the participant withdrew after consent.

eligible_screen

Logical. Whether the participant passed initial eligibility screening (e.g., diagnosis confirmed).

Examples

mock_cohortflow()

# Larger study with three periods
mock_cohortflow(n_participants = 2000, n_clusters = 20, n_periods = 3, seed = 42)

Select rows within groups

Description

The predicate is evaluated separately within each group defined by by and must return a logical vector the same length as the group. Rows where the predicate is FALSE are dropped and recorded in the excluded rows store. This is the natural way to select one (or more) records per unit, e.g. the index operation per patient.

Usage

select_within(criteria, by, predicate, label, category = NULL)

Arguments

criteria

A cf_criteria object (or NULL to start a new one).

by

A single column name (string) defining the grouping (e.g. "participant_id").

predicate

A one-sided formula evaluated within each group, or a function that receives a group's rows as a data frame and returns a logical vector.

label

A short human-readable description of this step.

category

An optional string grouping this step with others for display. NULL leaves it uncategorised.

Value

The updated cf_criteria object.

Examples

cf_criteria() |>
  select_within(by = "participant_id",
                ~ consent_date == min(consent_date, na.rm = TRUE),
                label = "Index operation per patient")