| Title: | Define and Apply Cohort Inclusion/Exclusion Criteria |
|---|---|
| Description: | Define inclusion and exclusion criteria for cohort studies using formulas or functions, apply them to data, and render the resulting attrition flow as CONSORT flow diagrams, tables, or narrative text. Supports hierarchical designs (participants nested in clusters/sites) and flexible grouping for reporting. Criteria pipelines can be serialised to YAML for reproducibility and sharing. Parts of this package were developed with the assistance of GitHub Copilot (powered by Claude), an AI coding assistant. |
| Authors: | Matthew Moore [aut, cre] (ORCID: <https://orcid.org/0000-0003-0730-8027>) |
| Maintainer: | Matthew Moore <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.0.9009 |
| Built: | 2026-05-28 20:31:32 UTC |
| Source: | https://github.com/mattmoo/cohortflow |
Evaluates each step in a cf_criteria pipeline cumulatively against a data
frame, recording which rows are removed at each step. Returns a cf_flow
object from which the surviving cohort (cohort()) and excluded rows
(excluded()) can be extracted.
apply_criteria(data, criteria, id = NULL)apply_criteria(data, criteria, id = NULL)
data |
A data frame. |
criteria |
A |
id |
A single column name (string) to use as the row identifier. If
|
A cf_flow object.
apply_criteria() needs a stable row identifier to track exclusions. By
default it looks for a .cf_row_id column in data. If one is not found
and id is NULL, a sequential integer identifier is added automatically
(with a message). Supply id to use an existing column as the identifier.
| Type | Predicate | Effect |
include |
Row-wise logical vector | Keep TRUE rows |
exclude |
Row-wise logical vector | Drop TRUE rows |
group_include |
Per-group scalar (summarise context) | Keep rows in groups where result is TRUE |
group_exclude |
Per-group scalar (summarise context) | Drop rows in groups where result is TRUE |
select_within |
Per-group logical vector | Keep TRUE rows within each group
|
dat <- mock_cohortflow(n_participants = 200, seed = 1) crit <- cf_criteria() |> include(~ eligible_screen, label = "Passed screening") |> include(~ !is.na(consent_date), label = "Consent recorded") |> exclude(~ withdrew, label = "Withdrew consent") |> group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5") flow <- apply_criteria(dat, crit) flow cohort(flow) excluded(flow)dat <- mock_cohortflow(n_participants = 200, seed = 1) crit <- cf_criteria() |> include(~ eligible_screen, label = "Passed screening") |> include(~ !is.na(consent_date), label = "Consent recorded") |> exclude(~ withdrew, label = "Withdrew consent") |> group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5") flow <- apply_criteria(dat, crit) flow cohort(flow) excluded(flow)
Produces a formatted attrition table from a cf_flow object. Three output
backends are supported: "flextable" (best for Word), "gt" (best for
HTML and LaTeX), and "huxtable". The underlying data is built by
as_attrition_tibble(), which you can call directly for custom formatting.
as_attrition_table( flow, backend = c("flextable", "gt", "huxtable"), show_categories = TRUE, assessed_label = "Assessed for eligibility", final_label = "Final cohort", digits = 1L, criterion_col_label = "Criterion", n_col_label = "N", removed_col_label = "Removed", pct_col_label = "% removed" )as_attrition_table( flow, backend = c("flextable", "gt", "huxtable"), show_categories = TRUE, assessed_label = "Assessed for eligibility", final_label = "Final cohort", digits = 1L, criterion_col_label = "Criterion", n_col_label = "N", removed_col_label = "Removed", pct_col_label = "% removed" )
flow |
A |
backend |
Character string: |
show_categories |
Logical. When |
assessed_label |
Character string for the header row. Default
|
final_label |
Character string for the trailing final-cohort row.
Default |
digits |
Integer. Number of decimal places for |
criterion_col_label |
Column header for the criterion/label column.
Default |
n_col_label |
Column header for the N column. Default |
removed_col_label |
Column header for the removed column.
Default |
pct_col_label |
Column header for the percentage column.
Default |
A flextable, gt_tbl, or huxtable object, ready to print or
include in a document.
"flextable" — uses the flextable and officer packages.
Renders natively to Word (.docx) via officer::read_docx(), to PDF
via flextable::save_as_image(), and to HTML. Best choice when Word is
the primary target.
"gt" — uses the gt package. Renders to HTML
(gt::gtsave(..., "table.html")), LaTeX
(gt::as_latex()), and Word (via gt::gtsave(..., "table.docx")
requires the webshot2 package). Best choice when LaTeX or HTML
is the primary target.
"huxtable" — uses the huxtable package. Supports Word, LaTeX,
and HTML output.
as_attrition_tibble() for the underlying plain-tibble data layer.
## Not run: dat <- mock_cohortflow(n_participants = 200, seed = 1) crit <- cf_criteria() |> include(~ !is.na(age), label = "Age recorded", category = "Age") |> include(~ age >= 18, label = "Adults only", category = "Age") |> include(~ eligible_screen, label = "Passed screening") |> exclude(~ withdrew, label = "Withdrew consent") flow <- apply_criteria(dat, crit) as_attrition_table(flow) # flextable (Word-ready) as_attrition_table(flow, backend = "gt") # gt (LaTeX/HTML-ready) as_attrition_table(flow, backend = "huxtable") # huxtable # Save to Word ft <- as_attrition_table(flow) flextable::save_as_docx(ft, path = "attrition.docx") # Save to Word via gt gt_tbl <- as_attrition_table(flow, backend = "gt") gt::gtsave(gt_tbl, "attrition.docx") # LaTeX snippet via gt gt::as_latex(gt_tbl) ## End(Not run)## Not run: dat <- mock_cohortflow(n_participants = 200, seed = 1) crit <- cf_criteria() |> include(~ !is.na(age), label = "Age recorded", category = "Age") |> include(~ age >= 18, label = "Adults only", category = "Age") |> include(~ eligible_screen, label = "Passed screening") |> exclude(~ withdrew, label = "Withdrew consent") flow <- apply_criteria(dat, crit) as_attrition_table(flow) # flextable (Word-ready) as_attrition_table(flow, backend = "gt") # gt (LaTeX/HTML-ready) as_attrition_table(flow, backend = "huxtable") # huxtable # Save to Word ft <- as_attrition_table(flow) flextable::save_as_docx(ft, path = "attrition.docx") # Save to Word via gt gt_tbl <- as_attrition_table(flow, backend = "gt") gt::gtsave(gt_tbl, "attrition.docx") # LaTeX snippet via gt gt::as_latex(gt_tbl) ## End(Not run)
Produces a flat tibble describing the attrition at each step (or category)
of a cohort flow pipeline. This is the underlying data layer used by
as_attrition_table(), and is also useful for custom formatting.
as_attrition_tibble( flow, show_categories = TRUE, assessed_label = "Assessed for eligibility", final_label = "Final cohort", digits = 1L )as_attrition_tibble( flow, show_categories = TRUE, assessed_label = "Assessed for eligibility", final_label = "Final cohort", digits = 1L )
flow |
A |
show_categories |
Logical. When |
assessed_label |
Character string for the header row. Default
|
final_label |
Character string for the trailing final-cohort row.
Default |
digits |
Integer. Number of decimal places for |
A tibble::tibble() with columns:
row_type"header", "category", "step", or "final"
labelDisplay text for the row
indent_level0 = header/final, 1 = category or top-level step,
2 = sub-step under a category
nNumber of participants at this point (N entering for category/step rows; N surviving for the final row)
n_removedNumber removed at this step (NA for header/final)
pct_removedPercentage removed relative to entering N (NA for
header; for final row: percentage retained of original N)
The tibble always starts with an "assessed" header row and ends with a
"final cohort" row. When show_categories = TRUE (the default), steps
that share a category are collapsed into a parent row (indent level 1)
with individual step sub-rows (indent level 2) nested beneath it.
Uncategorised steps appear at indent level 1 without sub-rows. When
show_categories = FALSE, one row per step is produced at indent level 1.
pct_removed: percentage of entering N removed at this row's step (or
category). For the final cohort row this is the percentage retained
relative to the initial N (i.e. n / n_start * 100).
Percentages are relative to n_in — the number entering that step or,
for a category row, the number entering the first step in that category.
as_attrition_table() for formatted table output.
dat <- mock_cohortflow(n_participants = 200, seed = 1) crit <- cf_criteria() |> include(~ !is.na(age), label = "Age recorded", category = "Age") |> include(~ age >= 18, label = "Adults only", category = "Age") |> include(~ eligible_screen, label = "Passed screening", category = "Screening") |> exclude(~ withdrew, label = "Withdrew consent") flow <- apply_criteria(dat, crit) as_attrition_tibble(flow)dat <- mock_cohortflow(n_participants = 200, seed = 1) crit <- cf_criteria() |> include(~ !is.na(age), label = "Age recorded", category = "Age") |> include(~ age >= 18, label = "Adults only", category = "Age") |> include(~ eligible_screen, label = "Passed screening", category = "Screening") |> exclude(~ withdrew, label = "Withdrew consent") flow <- apply_criteria(dat, crit) as_attrition_tibble(flow)
cf_criteria() initialises an empty criteria pipeline. Use the pipe
operators include(), exclude(), group_include(), group_exclude(),
and select_within() to add steps.
cf_criteria(...)cf_criteria(...)
... |
Optional |
A cf_criteria object.
has_consent <- function(d) !is.na(d$consent_date) crit <- cf_criteria() |> include(~ age >= 18, label = "Adults only") |> include(has_consent, label = "Consent recorded") |> exclude(~ withdrew, label = "Withdrew consent") |> group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5") |> select_within(by = "participant_id", ~ consent_date == min(consent_date, na.rm = TRUE), label = "Index operation") crithas_consent <- function(d) !is.na(d$consent_date) crit <- cf_criteria() |> include(~ age >= 18, label = "Adults only") |> include(has_consent, label = "Consent recorded") |> exclude(~ withdrew, label = "Withdrew consent") |> group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5") |> select_within(by = "participant_id", ~ consent_date == min(consent_date, na.rm = TRUE), label = "Index operation") crit
A cf_criterion represents one inclusion or exclusion step in a cohort
eligibility pipeline. Five step types are supported (see type). The
predicate is always a
one-sided formula or a function; the interpretation depends on
the step type.
cf_criterion( predicate, label, type = c("include", "exclude", "group_include", "group_exclude", "select_within"), by = NULL, category = NULL )cf_criterion( predicate, label, type = c("include", "exclude", "group_include", "group_exclude", "select_within"), by = NULL, category = NULL )
predicate |
A one-sided formula or a function.
|
label |
A short human-readable description of this criterion. |
type |
One of |
by |
For grouped step types ( |
category |
An optional character string grouping this criterion with
others for display purposes (e.g. |
A cf_criterion object (an S3 list).
# Row-wise inclusion cf_criterion(~ age >= 18, label = "Adults only", type = "include") # Grouped under a category cf_criterion(~ !is.na(age), label = "Age recorded", type = "include", category = "Age") # Group-level inclusion (clusters with >= 5 participants) cf_criterion( ~ n() >= 5, label = "Sufficient cluster size", type = "group_include", by = "cluster_id" ) # Select first operation per patient cf_criterion( ~ consent_date == min(consent_date, na.rm = TRUE), label = "Index operation", type = "select_within", by = "participant_id" )# Row-wise inclusion cf_criterion(~ age >= 18, label = "Adults only", type = "include") # Grouped under a category cf_criterion(~ !is.na(age), label = "Age recorded", type = "include", category = "Age") # Group-level inclusion (clusters with >= 5 participants) cf_criterion( ~ n() >= 5, label = "Sufficient cluster size", type = "group_include", by = "cluster_id" ) # Select first operation per patient cf_criterion( ~ consent_date == min(consent_date, na.rm = TRUE), label = "Index operation", type = "select_within", by = "participant_id" )
A cf_hierarchy describes how observational units are nested within each
other (e.g., participants within clusters within sites). It is an ordered
named character vector mapping role names to column names in the data.
cf_hierarchy(...)cf_hierarchy(...)
... |
Named character scalars mapping role labels to column names,
e.g. |
The order of ... is from the finest (innermost) level to the
coarsest (outermost) level. The first role is conventionally the
individual participant.
A cf_hierarchy object (a named character vector with additional
class attribute).
# Two-level: participants within clusters cf_hierarchy(participant = "pid", cluster = "cid") # Three-level: participants within clusters within sites cf_hierarchy(participant = "pid", cluster = "cluster_id", site = "site_id")# Two-level: participants within clusters cf_hierarchy(participant = "pid", cluster = "cid") # Three-level: participants within clusters within sites cf_hierarchy(participant = "pid", cluster = "cluster_id", site = "site_id")
Returns the rows that passed all criteria, with the internal .cf_row_id
column removed.
cohort(flow)cohort(flow)
flow |
A |
A tibble.
Add an exclusion criterion to a criteria pipeline
exclude(criteria, predicate, label, category = NULL)exclude(criteria, predicate, label, category = NULL)
criteria |
A |
predicate |
A one-sided formula or a function evaluated row-wise. |
label |
A short human-readable description of this criterion. |
category |
An optional string grouping this step with others for
display. |
The updated cf_criteria object.
cf_criteria() |> exclude(~ withdrew, label = "Withdrew consent")cf_criteria() |> exclude(~ withdrew, label = "Withdrew consent")
Returns a flat tibble of all rows removed at any step, with additional
columns cf_step (integer), cf_label (character), cf_type (character),
and cf_category (character, NA if the criterion had no category).
Rows are in step order.
excluded(flow)excluded(flow)
flow |
A |
A tibble with original columns plus cf_step, cf_label,
cf_type, cf_category.
cf_criteria object to a human-readable YAML file (or string).
The schema is self-describing and can be re-imported with import_criteria().Serialises a cf_criteria object to a human-readable YAML file (or string).
The schema is self-describing and can be re-imported with import_criteria().
export_criteria(criteria, path = NULL, fn_refs = list())export_criteria(criteria, path = NULL, fn_refs = list())
criteria |
A |
path |
A file path to write to. If |
fn_refs |
An optional named list mapping function objects to
|
Invisibly, the criteria object (or the YAML string if path = NULL).
cohortflow_criteria:
version: 1
steps:
- label: "Adults only"
type: include
kind: formula
by: null
expr: "age >= 18"
fn_ref: null
Formula-based criteria round-trip cleanly. Anonymous function bodies are
deparsed as a fallback; complex closures may not round-trip and a warning
is issued. Named exported functions can be stored as "pkg::name" references
via fn_refs.
The predicate is evaluated per group; groups where the predicate returns
TRUE are removed (all their rows dropped).
group_exclude(criteria, by, predicate, label, category = NULL)group_exclude(criteria, by, predicate, label, category = NULL)
criteria |
A |
by |
A single column name (string) to group by. |
predicate |
A one-sided formula using summary functions, or a function. |
label |
A short human-readable description of this criterion. |
category |
An optional string grouping this step with others for
display. |
The updated cf_criteria object.
cf_criteria() |> group_exclude(by = "cluster_id", ~ mean(is.na(age)) > 0.5, label = "Excessive missing age in cluster")cf_criteria() |> group_exclude(by = "cluster_id", ~ mean(is.na(age)) > 0.5, label = "Excessive missing age in cluster")
The predicate is evaluated in a dplyr::summarise() context per group
(when a formula) or receives a grouped data frame (when a function). Groups
where the predicate returns TRUE are kept; all rows in failing groups are
removed.
group_include(criteria, by, predicate, label, category = NULL)group_include(criteria, by, predicate, label, category = NULL)
criteria |
A |
by |
A single column name (string) to group by. |
predicate |
A one-sided formula using summary functions ( |
label |
A short human-readable description of this criterion. |
category |
An optional string grouping this step with others for
display. |
The updated cf_criteria object.
cf_criteria() |> group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5")cf_criteria() |> group_include(by = "cluster_id", ~ n() >= 5, label = "Cluster size >= 5")
Reads a YAML file (or string) written by export_criteria() and
reconstructs a cf_criteria object.
import_criteria(path = NULL, text = NULL, envir = parent.frame())import_criteria(path = NULL, text = NULL, envir = parent.frame())
path |
A file path to read from. Exactly one of |
text |
A YAML string. Exactly one of |
envir |
The environment in which to evaluate re-parsed expressions. |
A cf_criteria object.
Add an inclusion criterion to a criteria pipeline
include(criteria, predicate, label, category = NULL)include(criteria, predicate, label, category = NULL)
criteria |
A |
predicate |
A one-sided formula or a function evaluated row-wise. |
label |
A short human-readable description of this criterion. |
category |
An optional string grouping this step with others for
display (e.g. |
The updated cf_criteria object.
cf_criteria() |> include(~ age >= 18, label = "Adults", category = "Age")cf_criteria() |> include(~ age >= 18, label = "Adults", category = "Age")
Produces a tibble that mimics a clustered cohort study with optional stepped-wedge period structure. The returned data is deliberately imperfect: some participants have missing values, withdrew consent, or failed screening, so that inclusion/exclusion criteria have something to remove.
mock_cohortflow( n_participants = 500L, n_clusters = 10L, n_sites = 3L, n_periods = 4L, seed = 123L )mock_cohortflow( n_participants = 500L, n_clusters = 10L, n_sites = 3L, n_periods = 4L, seed = 123L )
n_participants |
Integer. Total number of participant rows. |
n_clusters |
Integer. Number of clusters (e.g., GP practices, schools). |
n_sites |
Integer. Number of sites (clusters are nested in sites;
must be <= |
n_periods |
Integer. Number of time periods (e.g., waves in a
stepped-wedge design). Set to |
seed |
Integer. Random seed for reproducibility. |
A tibble::tibble() with columns:
participant_idCharacter. Unique participant identifier.
event_idCharacter. Unique event (row) identifier — useful when the dataset has multiple rows per participant (e.g., one per operation).
cluster_idCharacter. Cluster identifier.
site_idCharacter. Site identifier (clusters nested in sites).
periodInteger. Study period (1 = first period).
sequenceInteger. Stepped-wedge sequence the cluster is assigned
to (NA if n_periods == 1).
ageNumeric. Age in years (some NAs to simulate missing data).
age_groupCharacter. Age group ("<18", "18-39", "40-64", "65+").
sexCharacter. "M" / "F" / "O" (other).
ethnicityCharacter. One of five broad ethnic groups.
consent_dateDate. Date of consent (NA = did not consent).
baseline_completeLogical. Whether the baseline assessment is complete.
withdrewLogical. Whether the participant withdrew after consent.
eligible_screenLogical. Whether the participant passed initial eligibility screening (e.g., diagnosis confirmed).
mock_cohortflow() # Larger study with three periods mock_cohortflow(n_participants = 2000, n_clusters = 20, n_periods = 3, seed = 42)mock_cohortflow() # Larger study with three periods mock_cohortflow(n_participants = 2000, n_clusters = 20, n_periods = 3, seed = 42)
The predicate is evaluated separately within each group defined by by and
must return a logical vector the same length as the group. Rows where the
predicate is FALSE are dropped and recorded in the excluded rows store.
This is the natural way to select one (or more) records per unit, e.g. the
index operation per patient.
select_within(criteria, by, predicate, label, category = NULL)select_within(criteria, by, predicate, label, category = NULL)
criteria |
A |
by |
A single column name (string) defining the grouping (e.g.
|
predicate |
A one-sided formula evaluated within each group, or a function that receives a group's rows as a data frame and returns a logical vector. |
label |
A short human-readable description of this step. |
category |
An optional string grouping this step with others for
display. |
The updated cf_criteria object.
cf_criteria() |> select_within(by = "participant_id", ~ consent_date == min(consent_date, na.rm = TRUE), label = "Index operation per patient")cf_criteria() |> select_within(by = "participant_id", ~ consent_date == min(consent_date, na.rm = TRUE), label = "Index operation per patient")