Data Lineage¶
Translation in Progress
This page has not yet been translated into your language. You are viewing the original English content.
Want to help? See our translation guide.
Scientific workflows produce chains of computations — each output depends on specific inputs, parameters, and tool versions. Nextflow's data lineage feature captures this provenance automatically, giving you a queryable record of every workflow run, task execution, and output file.
Experimental feature
Data lineage is an experimental feature introduced in Nextflow 25.04. The feature is under active development and its behavior may change in future releases. Consult the official documentation for the latest details.
Learning goals¶
By the end of this side quest, you'll be able to:
- Enable lineage tracking in a Nextflow workflow
- Explore lineage records using the CLI (
list,view,find,render,diff) - Understand the three lineage record types:
WorkflowRun,TaskRun, andFileOutput - Understand why the
output {}block is required for file-level provenance tracking - Compare two workflow runs to identify exactly what changed between them
- Use
channel.fromLineage()to create a channel from lineage records and feed tracked outputs into downstream workflows
Prerequisites¶
Before taking on this side quest, you should:
- Have completed the Hello Nextflow tutorial or equivalent beginner's course.
- Be comfortable with basic Nextflow concepts: processes, channels, and configuration.
0. Get started¶
Open the training codespace¶
If you haven't yet done so, make sure to open the training environment as described in the Environment Setup.
Move into the project directory¶
Move into the directory where the files for this tutorial are located.
You can set VSCode to focus on this directory:
Review the materials¶
You'll find a workflow file, a configuration file, and a data directory with a CSV input file.
The workflow in main.nf has two processes.
SAYHELLO writes each greeting to a text file, and CONVERTTOUPPER converts that file to uppercase.
This is the same pipeline logic used in Hello Nextflow.
The greetings.csv file contains four greetings:
Review the assignment¶
Your challenge is to add lineage tracking to the workflow, explore the records it produces, and use the lineage CLI to query and compare runs.
For each section:
- Apply the code changes to the workflow or configuration file as shown
- Run the workflow and observe the output
- Use the lineage CLI to inspect the records produced
- Complete the exercises before reading ahead
Readiness checklist¶
Think you're ready to dive in?
- I understand the goal of this course and its prerequisites
- My codespace is up and running
- I've set my working directory appropriately
- I understand the assignment
If you can check all the boxes, you're good to go.
1. Enable lineage tracking¶
Open main.nf to examine the starting workflow:
The workflow reads greetings from a CSV file, writes each one to a text file with SAYHELLO, converts it to uppercase with CONVERTTOUPPER, and publishes the results to a results/ directory.
This is the same pipeline used in Hello Nextflow, extended with an output {} block to publish results.
Run it once to verify everything works before adding lineage tracking:
Command output
1.1. Add lineage configuration¶
Add lineage.enabled = true to nextflow.config:
Changing the lineage store location
By default, lineage data is stored in a .lineage directory within your working directory.
To store it elsewhere, add:
Run the workflow with lineage tracking enabled:
Command output
The workflow output looks the same as before, but Nextflow has now written provenance records to the .lineage directory.
Directory contents — .lineage/
.lineage/
├── .history/
│ └── a1b2c3d4e5f6789012345678 ← one entry per workflow run
├── a1b2c3d4e5f6789012345678/
│ └── .data.json ← WorkflowRun
├── b2c3d4e5f6789012345678a1/
│ └── .data.json ← TaskRun (SAYHELLO/Hello)
├── b3c4d5e6f7890123456789a2/
│ └── .data.json ← TaskRun (SAYHELLO/Bonjour)
├── b4c5d6e7f8901234567890a3/
│ └── .data.json ← TaskRun (SAYHELLO/Hola)
├── b5c6d7e8f9012345678901a4/
│ └── .data.json ← TaskRun (SAYHELLO/Ciao)
├── c3d4e5f6789012345678a1b2/
│ └── .data.json ← TaskRun (CONVERTTOUPPER/Hello)
├── c4d5e6f7890123456789a2b3/
│ └── .data.json ← TaskRun (CONVERTTOUPPER/Bonjour)
├── c5d6e7f8901234567890a3b4/
│ └── .data.json ← TaskRun (CONVERTTOUPPER/Hola)
└── c6d7e8f9012345678901a4b5/
└── .data.json ← TaskRun (CONVERTTOUPPER/Ciao)
Each record is a .data.json file stored in a directory named after its LID hash.
The .history/ directory contains one file per workflow run, linking run names to LIDs.
Contents of a record file — WorkflowRun
{
"version": "lineage/v1beta1",
"kind": "WorkflowRun",
"spec": {
"workflow": {
"scriptFiles": [
{
"path": "/workspace/side-quests/lineage/main.nf",
"checksum": {
"value": "78910abc",
"algorithm": "nextflow",
"mode": "standard"
}
}
]
},
"sessionId": "4f02559e-9ebd-41d8-8ee2-a8d1e4f09c67",
"name": "happy_dijkstra",
"params": [
{
"type": "Path",
"name": "input",
"value": "data/greetings.csv"
}
],
"config": {
"docker": { "enabled": true },
"lineage": { "enabled": true }
},
"metadata": {
"start": "2026-06-15T10:00:00Z",
"end": "2026-06-15T10:00:45Z",
"nextflow": {
"version": "25.10.4",
"enable": { "dsl": 2 }
}
}
}
}
Every record follows the same envelope: version, kind, and spec.
The kind field identifies the record type (WorkflowRun, TaskRun, FileOutput, etc.).
1.2. Lineage record types¶
Nextflow records provenance using three primary record types:
| Type | Stores |
|---|---|
WorkflowRun |
Overall pipeline execution — run name, parameters, start/end times |
TaskRun |
Individual task execution — inputs, script, container, timing |
FileOutput |
A published output file — path, checksum, size, producing task |
Each record is identified by a unique Lineage ID (LID) formatted as lid://[hash].
The LID is the entry point for all lineage queries.
Takeaway¶
In this section, you've learned:
- How to enable lineage tracking: Set
lineage.enabled = trueinnextflow.config. - Record types:
WorkflowRun(the overall pipeline execution),TaskRun(each individual task), andFileOutput(each published output file). - Where records are stored: In
.lineagewithin your working directory by default and how to modify the configuration.
2. Explore lineage records with the CLI¶
Nextflow provides a lineage subcommand for querying the lineage store directly from the terminal.
It gives you five operations: list to enumerate runs, view to inspect a record by LID, find to search across records, render to generate a graph visualization, and diff to compare two records.
2.1. List workflow runs¶
To see all workflow runs in the lineage store, run:
Command output
Copy the LID for your run: you'll need it in the following steps.
2.2. View a WorkflowRun record¶
Pass the LID to nextflow lineage view to inspect the full record for the workflow run:
Command output
This record captures the run-level context: which parameters were used and when the pipeline ran.
2.3. Find records by field¶
The find subcommand searches the lineage store by record type or field value, without needing to know a LID upfront.
It is useful when you want to locate all records of a given type — for example, all TaskRun records across every workflow run, or all FileOutput records matching a particular path.
Use it to list all TaskRun records:
Command output
lid://b2c3d4e5f6789012345678a1 SAYHELLO 2026-06-15 10:00:10
lid://b3c4d5e6f7890123456789a2 SAYHELLO 2026-06-15 10:00:11
lid://b4c5d6e7f8901234567890a3 SAYHELLO 2026-06-15 10:00:12
lid://b5c6d7e8f9012345678901a4 SAYHELLO 2026-06-15 10:00:13
lid://c3d4e5f6789012345678a1b2 CONVERTTOUPPER 2026-06-15 10:00:20
lid://c4d5e6f7890123456789a2b3 CONVERTTOUPPER 2026-06-15 10:00:22
lid://c5d6e7f8901234567890a3b4 CONVERTTOUPPER 2026-06-15 10:00:24
lid://c6d7e8f9012345678901a4b5 CONVERTTOUPPER 2026-06-15 10:00:26
Copy the LID of any SAYHELLO task — you'll use it in the next step.
2.4. View a TaskRun record¶
Pass a TaskRun LID to view to inspect the full record:
Command output
This record shows exactly which input value was used for this specific task execution and which workflow run it belongs to.
2.5. Render the lineage graph¶
The render subcommand generates an HTML visualization of the lineage for a given run:
Open the generated HTML file in your browser.
It shows a directed graph connecting WorkflowRun → TaskRun → FileOutput for each published file.
Section 3 explores what this graph looks like when the output {} block is absent.
Takeaway¶
In this section, you've learned how to use:
nextflow lineage list: View all workflow runs in the store with their LIDs.nextflow lineage find: Query records by type or field without knowing the LID in advance.nextflow lineage view [LID]: Inspect the full JSON record for anyWorkflowRunorTaskRun.nextflow lineage render [LID]: Generate a graph visualization of a run's provenance.
3. File outputs and lineage¶
The output {} block is enables the creation FileOutput records in the lineage store.
To understand why it matters, try removing it temporarily and observe what changes.
3.1. Remove the output block¶
Remove the publish: section and output {} block from main.nf:
| main.nf | |
|---|---|
Run the workflow:
Command output
Now try to view the published outputs of this run using the #output suffix:
The lineage store has no knowledge of the files produced — only that the tasks ran.
Render the graph for this run and you'll see it ends at the TaskRun nodes with no file nodes attached.
Takeaway¶
In this section, you've seen:
- Without
output {}block: The lineage store recordsWorkflowRunandTaskRunentries, but noFileOutputrecords. The provenance trail stops at the task level. - With
output {}block: Each published file becomes aFileOutputrecord with a checksum, size, and a direct link back to the task that produced it.
Before continuing, restore the publish: section and output {} block to main.nf.
4. Comparing runs with diff¶
The diff subcommand compares two lineage records to show exactly what changed between them.
This is most useful when you re-run a workflow after modifying inputs or parameters.
4.1. Modify the input data¶
Edit data/greetings.csv to replace Hello with Hi:
4.2. Run the workflow again¶
Command output
Three tasks were cached (the greetings that didn't change) and one task in each process ran fresh for Hi.
List the runs to find the new LID:
Command output
4.3. Compare two task runs¶
Use nextflow lineage find to locate the SAYHELLO task LIDs from the first run and the third run, then compare them:
The diff identifies exactly what changed between the two task executions.
For tasks with many inputs or complex parameters, diff quickly pinpoints which values differed — without manually comparing JSON records.
Takeaway¶
In this section, you've seen:
nextflow lineage diff [LID-A] [LID-B]: Compare any two lineage records and surface exactly what changed.- Practical use: When debugging a pipeline or verifying reproducibility,
diffconfirms whether two runs processed identical inputs — and if not, what differed.
Summary¶
Data lineage in Nextflow gives you a queryable record of every workflow run, task execution, and published file.
Key commands¶
| Command | Purpose |
|---|---|
nextflow lineage list |
List all workflow runs in the store |
nextflow lineage view [LID] |
Inspect any record as JSON |
nextflow lineage view [LID]#output |
View published file records for a run |
nextflow lineage find |
Query records by type or field |
nextflow lineage render [LID] |
Generate an HTML lineage graph |
nextflow lineage diff [LID-A] [LID-B] |
Compare two records |
Key patterns¶
- Enable tracking in
nextflow.config:
- List all workflow runs in the lineage store:
- Inspect any record by LID:
- View published file records for a run using the
#outputsuffix:
- Query records by type without knowing the LID:
- Render a lineage graph as HTML:
- Compare any two records to surface exactly what changed between them:
Additional resources¶
What's next?¶
Return to the menu of Side Quests or click the button in the bottom right of the page to move on to the next topic in the list.