---
title: "Concepts and conventions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2. Concepts and conventions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(siera)
```

This vignette steps back from the "how" (covered in the other vignettes) to explain the "why" behind _siera_: what an Analysis Results Dataset (ARD) is good for, how _siera_ fits into the workflow, and the conventions you will see in the generated scripts and their output.

## Why an ARD?

Traditionally, analysis results have lived inside static outputs - the numbers in an RTF or PDF table. An **Analysis Results Dataset (ARD)** instead stores those same results as _machine-readable data_, one row per result, with metadata describing exactly what each number is. Once results are data, several things become a lot easier:

- **QC** - results can be compared programmatically against an independent calculation, instead of by eye.
- **Re-use across outputs** - the same result can feed a table, a figure, or a listing without being re-calculated.
- **Re-use across reporting events** - results computed once can be carried into a later submission or a different deliverable.
- **Creating final TFLs** - downstream packages such as [`tfrmt`](https://gsk-biostatistics.github.io/tfrmt/) and [`gtsummary`](https://www.danieldsjoberg.com/gtsummary/) format ARDs straight into submission-ready tables.

_siera's_ job is to get you to an ARD without writing the analysis code by hand: you supply ARS metadata, and _siera_ writes the R that produces the ARD.

## The siera pipeline

```{r pipeline diagram, echo=FALSE, out.width="100%", fig.alt="ARS metadata is read by readARS(), which writes one R script per output; each script runs against ADaM datasets to produce an ARD."}
knitr::include_graphics("figures/siera-pipeline.svg")
```

The flow is always the same:

1. You start with **ARS metadata** (the Analysis Results Standard description of your reporting event), in either JSON or Excel format.
2. You pass it to **`readARS()`**, which writes **one R script per Output** defined in the metadata.
3. You run a generated script against your **ADaM datasets** - supplied as either CSV (`.csv`) or SAS transport (`.xpt`) files - and the result is an **ARD** - one row per result, ready for downstream use. siera reads each ADaM dataset according to its file extension (`.csv` with `readr::read_csv()`, `.xpt` with `haven::read_xpt()`), so no extra argument is needed.

The statistical computation itself is performed by the [`cards`](https://insightsengineering.github.io/cards/) and [`cardx`](https://insightsengineering.github.io/cardx/) packages, whose functions _siera_ writes into the generated scripts (see the vignette on [using `cards` and `cardx`](using-cards.html)).

## The seven ARS sections siera consumes

ARS metadata describes a whole reporting event, but _siera_ only needs seven sections to generate code. It is worth knowing what each one contributes:

| ARS section | What siera does with it |
|---|---|
| _mainListOfContents_ | Links each Output to its analyses, and sets the row order and indentation of the table stub. |
| _otherListsOfContents_ | Supplies Output-level metadata (the list of planned outputs). |
| _analysisSets_ | Defines the population filter for the Output (e.g. Safety Population, `SAFFL == "Y"`). |
| _dataSubsets_ | Adds row-level filters for individual analyses (e.g. serious, treatment-emergent AEs). |
| _analysisGroupings_ | Defines the columns/subgroups results are split by (e.g. treatment arm), including data-driven groupings discovered at run time. |
| _analyses_ | Ties everything together for one calculation: which method, population, subset and groupings apply. |
| _methods_ | Describes the operations to perform, and carries the dynamic R code template _siera_ fills in (inline, or *referenced* from an external method library - see _Using cards_). |

Each generated script is assembled from these pieces, and every result it produces carries identifiers back to them (see "Reading an ARD row" below).

A method's code template need not be written inline: _siera_ can also resolve it from an external **reference document** (the ARS `codeTemplate.documentRef` mechanism), so an ARS file can point at a shared, tested method library by `id` rather than copy-pasting code. This builds on `referenceDocuments`, which is otherwise outside the seven sections above. See the _Using cards and cardx_ article for how to wire it up.

## JSON and XLSX parity

ARS metadata officially travels as JSON, but _siera_ also accepts an Excel (XLSX) representation of the same information. **The two are semantically equivalent** - `readARS()` produces the same generated scripts either way, so you can choose whichever format fits your tooling. The examples shipped with the package include both (see `ARS_example()`).

## Reading an ARD row: CDISC traceability columns

A core promise of the ARD is _traceability_: every result can be traced back to the metadata that defines it. To make that possible, each row of a _siera_-generated ARD carries identifier columns alongside the statistic itself:

- **`AnalysisId`** - which analysis produced the row.
- **`operationid`** - which operation within the method (e.g. the `n` count vs. the `%`).
- For each grouping applied to the analysis, a set of `group[n]_*` columns:
  - **`group[n]_groupingId`** - the grouping the column belongs to (e.g. the treatment-arm grouping).
  - **`group[n]_groupId`** - for **pre-defined groups** (groups listed explicitly in the metadata), the identifier of the specific group.
  - **`group[n]_groupValue`** - for **data-driven groupings** (`dataDriven: true`, where the categories are discovered from the ADaM data at run time, e.g. cause of death or AE term), the actual value found in the data.

The distinction matters: a treatment-arm grouping is usually pre-defined, so its rows carry `group1_groupId`; a grouping such as "cause of death" is typically data-driven, so its rows carry `group1_groupValue` with whatever categories appeared in the data. A single ARD can contain both. The [ARD program structure](ARD_script_structure.html) vignette shows where these columns are stamped on in the generated code.