Provenance capture
------------------
It is possible to capture the full provenance of a workflow execution to
a folder, including intermediate values:
.. code-block:: sh
cwltool --provenance revsort-run-1/ tests/wf/revsort.cwl tests/wf/revsort-job.json
Who executed the workflow?
^^^^^^^^^^^^^^^^^^^^^^^^^^
Optional parameters are available to capture information about *who* executed the workflow *where*:
.. code-block:: sh
cwltool --orcid https://orcid.org/0000-0002-1825-0097 \
--full-name "Alice W Land" \
--enable-user-provenance --enable-host-provenance \
--provenance revsort-run-1/ \
tests/wf/revsort.cwl tests/wf/revsort-job.json
These parameters are opt-in as they track person-identifiable information.
The options ``--enable-user-provenance`` and ``--enable-host-provenance`` will
pick up account/machine info from where ``cwltool`` is executed (e.g.
UNIX username). This may get the full name of the user wrong, in which case
``--full-name`` can be supplied.
For consistent tracking it is recommended to apply for
an `ORCID `__ identifier and provide it as above,
since ``--enable-user-provenance --enable-host-provenance``
are only able to identify the local machine account.
It is possible to set the shell environment variables
``ORCID`` and ``CWL_FULL_NAME`` to avoid supplying ``--orcid``
or ``--full-name`` for every workflow run,
for instance by augmenting the ``~/.bashrc`` or equivalent:
.. code-block:: sh
export ORCID=https://orcid.org/0000-0002-1825-0097
export CWL_FULL_NAME="Stian Soiland-Reyes"
Care should be taken to preserve spaces when setting `--full-name` or `CWL_FULL_NAME`.
CWLProv folder structure
^^^^^^^^^^^^^^^^^^^^^^^^
The CWLProv folder structure under ``revsort-run-1`` is a
`Research Object `__
that conforms to the `RO BagIt profile `__
and contains `PROV `__
traces detailing the execution of the workflow and its steps.
A rough overview of the CWLProv folder structure:
* ``bagit.txt`` - bag marker for `BagIt `__.
* ``bag-info.txt`` - minimal bag metadata. ``The External-Identifier`` key shows which `arcp `__ can be used as base URI within the folder bag.
* ``manifest-*.txt`` - checksums of files under ``data/`` (algorithms subject to change)
* ``tagmanifest-*.txt`` - checksums of the remaining files (algorithms subject to change)
* ``metadata/manifest.json`` - `Research Object manifest `__ as JSON-LD. Types and relates files within bag.
* ``metadata/provenance/primary.cwlprov*`` - `PROV `__ trace of main workflow execution in alternative PROV and RDF formats
* ``data/`` - bag payload, workflow/step input/output data files (content-addressable)
* ``data/32/327fc7aedf4f6b69a42a7c8b808dc5a7aff61376`` - a data item with checksum ``327fc7aedf4f6b69a42a7c8b808dc5a7aff61376`` (checksum algorithm is subject to change)
* ``workflow/packed.cwl`` - The ``cwltool --pack`` standalone version of the executed workflow
* ``workflow/primary-job.json`` - Job input for use with ``packed.cwl`` (references ``data/*``)
* ``snapshot/`` - Direct copies of original files used for execution, but may have broken relative/absolute paths
See the `CWLProv paper `__ for more details.
Research Object manifest
^^^^^^^^^^^^^^^^^^^^^^^^
The file ``metadata/manifest.json`` follows the structure defined for `Research Object Bundles `_ - but
note that ``.ro/`` is instead called ``metadata/`` as this conforms to the `RO BagIt profile `__.
Some of the keys of the CWLProv manifest are explained below::
"@context": [
{
"@base": "arcp://uuid,67f38794-d24a-435f-bd4a-0242a56a581b/metadata/"
},
"https://w3id.org/bundle/context"
]
This `JSON-LD context `__ enables consumers to alternatively consume the JSON file as Linked Data with absolute identifiers.
The key for that is the ``@base`` which means URIs within this JSON file are relative to the ``metadata/`` folder
within this Research Object bag, and the external JSON-LD .
Output from ``cwltool`` should follow the JSON structure shown beyond; however interested consumer may alternatively parse it as JSON-LD with a RDF triple store like `Apache Jena `__ for further querying.
The manifest lists which software version created the Research Object - we will hear more from this UUID later::
"createdBy": {
"uri": "urn:uuid:7c9d9e88-666b-4977-85f4-c02da08a942d",
"name": "cwltool 1.0.20180416145054"
}
Secondly the manifest lists the person who "authored the run" - that is put the workflow and inputs together with cwltool::
"authoredBy": {
"orcid": "https://orcid.org/0000-0002-1825-0097",
"name": "Stian Soiland-Reyes"
}
Note that the author of the workflow run may differ from the author of the workflow definition.
The list of aggregates are the main resources that this Research Object transports::
"aggregates": [
{
"uri": "urn:hash::sha1:53870991af88a6d678cbeed3255bb65993c52925",
...
},
{ "provenance/primary.cwlprov.xml",
...
},
{
"uri": "../workflow/packed.cwl",
"createdBy": {
"uri": "urn:uuid:7c9d9e88-666b-4977-85f4-c02da08a942d",
"name": "cwltool 1.0.20180416145054"
},
"conformsTo": "https://w3id.org/cwl/",
"mediatype": "text/x+yaml; charset=\"UTF-8\"",
"createdOn": "2018-04-16T18:27:09.513824"
},
{
"uri": "../snapshot/hello-workflow.cwl",
"conformsTo": "https://w3id.org/cwl/",
"mediatype": "text/x+yaml; charset=\"UTF-8\"",
"createdOn": "2018-04-04T13:29:55.717707"
}
Beyond being a listing of file names and identifiers, this also lists formats and light-weight provenance. We note that the
CWL file is marked to conform to the https://w3id.org/cwl/ CWL specification.
Some of the files like ``packed.cwl`` have been created by cwltool as part of the run, while others have been created before the run "outside".
Note that ``cwltool`` is currently unable to extract the original authors and contributors of the original files, this is planned for future versions.
Under ``annotations`` we see that the main point of this whole research object (``/`` aka ``arcp://uuid,67f38794-d24a-435f-bd4a-0242a56a581b/``)
is to describe something called ``urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b``::
"annotations": [
{
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
"content": "/",
"oa:motivatedBy": {
"@id": "oa:describing"
}
},
We will later see that this is the UUID for the workflow run. A workflow run is an *activity*,
something that happens - it can't be directly saved to a file. However it can be *described* in
different ways, in this case as CWLProv provenance::
{
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
"content": [
"provenance/primary.cwlprov.xml",
"provenance/primary.cwlprov.nt",
"provenance/primary.cwlprov.ttl",
"provenance/primary.cwlprov.provn",
"provenance/primary.cwlprov.jsonld",
"provenance/primary.cwlprov.json"
],
"oa:motivatedBy": {
"@id": "http://www.w3.org/ns/prov#has_provenance"
}
Finally the research object wants to highlight the workflow file::
{
"about": "workflow/packed.cwl",
"oa:motivatedBy": {
"@id": "oa:highlighting"
}
},
And links the run ID ``67f38794..`` to the ```primary-job.json`` and ``packed.cwl``::
{
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
"content": [
"workflow/packed.cwl",
"workflow/primary-job.json"
],
"oa:motivatedBy": {
"@id": "oa:linking"
}
}
Note: ``oa:motivatedBy`` in CWLProv are subject to change.
PROV profile
^^^^^^^^^^^^
The underlying model and information of the `PROV `__
files under ``metadata/provenance`` is the same, but is made available in multiple
serialization formats:
* primary.cwlprov.provn -- `PROV-N `__ Textual Provenance Notation
* primary.cwlprov.xml -- `PROV-XML `__
* primary.cwlprov.json -- `PROV-JSON `__
* primary.cwlprov.jsonld -- `PROV-O `__ as `JSON-LD `__ (``@context`` subject to change)
* primary.cwlprov.ttl -- `PROV-O `__ as `RDF Turtle `__
* primary.cwlprov.nt -- `PROV-O `__ as `RDF N-Triples `__
The below extracts use the PROV-N syntax for brevity.
CWLPROV namespaces
^^^^^^^^^^^^^^^^^^
Note that the identifiers must be expanded with the defined ``prefix``-es when comparing across serializations.
These set which vocabularies ("namespaces") are used by the CWLProv statements::
prefix data
prefix input
prefix cwlprov
prefix wfprov
prefix sha256
prefix schema
prefix wfdesc
prefix orcid
prefix researchobject
prefix id
prefix wf
prefix foaf
Note that the `arcp `__ base URI will correspond to the UUID of each main workflow run.
Account who launched cwltool
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If ``--enable-user-provenance`` was used, the local machine account (e.g. Windows or UNIX user name) who started ``cwltool`` is tracked::
agent(id:855c6823-bbe7-48a5-be37-b0f07f20c495, [foaf:accountName="stain", prov:type='foaf:OnlineAccount', prov:label="stain"])
It is assumed that the account was under the control of the named person (in PROV terms "actedOnBehalfOf")::
agent(id:433df002-2584-462a-80b0-cf90b97e6e07, [prov:label="Stian Soiland-Reyes",
prov:type='prov:Person', foaf:account='id:8815e39c-9711-4105-bf52-dbc016c8028f'])
actedOnBehalfOf(id:8815e39c-9711-4105-bf52-dbc016c8028f, id:433df002-2584-462a-80b0-cf90b97e6e07, -)
However we do not have an identifier for neither the account or the person, so every ``cwltool`` run will yield new UUIDs.
With ``--enable-user-provenance`` it is possible to associate the account with a hostname::
agent(id:855c6823-bbe7-48a5-be37-b0f07f20c495, [cwlprov:hostname="biggie", prov:type='foaf:OnlineAccount', prov:location="biggie"])
Note that the hostname is often non-global or variable (e.g. on cloud instances or virtual machines),
and thus may be unreliable when considering ``cwltool`` executions on multiple hosts.
If the ``--orcid`` parameter or ``ORCID`` shell variable is included, then the person associated
with the local machine account is uniquely identified, no matter where the workflow was executed::
agent(orcid:0000-0002-1825-0097, [prov:type='prov:Person', prov:label="Stian Soiland-Reyes",
foaf:account='id:855c6823-bbe7-48a5-be37-b0f07f20c495'])
actedOnBehalfOf(id:855c6823-bbe7-48a5-be37-b0f07f20c495', orcid:0000-0002-1825-0097, -)
The running of `cwltool` itself makes it the workflow engine. It is the machine account who launched the cwltool (not necessarily the person behind it)::
agent(id:7c9d9e88-666b-4977-85f4-c02da08a942d, [prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool 1.0.20180416145054"])
wasStartedBy(id:855c6823-bbe7-48a5-be37-b0f07f20c495, -, id:9c3d4d1f-473d-468f-a6f2-1ef4de571a7f, 2018-04-16T18:27:09.428090)
Starting a workflow
^^^^^^^^^^^^^^^^^^^
The main job of the cwltool execution is to run a workflow, here the activity for ``workflow/packed.cwl#main``::
activity(id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.428165, -, [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
wasStartedBy(id:67f38794-d24a-435f-bd4a-0242a56a581b, -, id:7c9d9e88-666b-4977-85f4-c02da08a942d, 2018-04-16T18:27:09.428285)
Now what is that workflow again? Well a tiny bit of prospective provenance is included::
entity(wf:main, [prov:type='prov:Plan', prov:type='wfdesc:Workflow', prov:label="Prospective provenance"])
entity(wf:main, [prov:label="Prospective provenance", wfdesc:hasSubProcess='wf:main/step0'])
entity(wf:main/step0, [prov:type='wfdesc:Process', prov:type='prov:Plan'])
But we can also expand the ``wf`` identifiers to find that we are talking about
``arcp://uuid,0e6cb79e-fe70-4807-888c-3a61b9bf232a/workflow/packed.cwl#`` - that is
the ``main`` workflow in the file ``workflow/packed.cwl`` of the Research Object.
Running workflow steps
^^^^^^^^^^^^^^^^^^^^^^
A workflow will contain some steps, each execution of these are again nested activities::
activity(id:6c7c04ea-dcc8-40d2-92a4-7705f7286756, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main"])
wasStartedBy(id:6c7c04ea-dcc8-40d2-92a4-7705f7286756, -, id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.430883)
activity(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step0"])
wasAssociatedWith(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, wf:main/step0)
Again we see the link back to the workflow plan, the workflow execution of ``#main/step0`` in this case.
Note that depending on scattering etc there might
be multiple activities for a single step in the workflow definition.
Data inputs (usage)
^^^^^^^^^^^^^^^^^^^
This activities uses some data at the input ``message``::
activity(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step0"])
used(id:a583b025-9a16-49ce-8515-f3249eb2aacf, data:53870991af88a6d678cbeed3255bb65993c52925, 2018-04-16T18:27:09.433743, [prov:role='wf:main/step0/message'])
Data files within a workflow execution are identified using ``urn:hash::sha1:`` URIs derived from their sha1 checksum (checksum algorithm and prefix subject to change)::
entity(data:53870991af88a6d678cbeed3255bb65993c52925, [prov:type='wfprov:Artifact', prov:value="Hei7"])
Small values (typically those provided on the command line may be present as `prov:value`. The corresponding
``data/`` file within the Research Object has a content-addressable filename based on the checksum; but it is also
possible to look up this independent from the corresponding ``metadata/manifest.json`` aggregation::
"aggregates": [
{
"uri": "urn:hash::sha1:53870991af88a6d678cbeed3255bb65993c52925",
"bundledAs": {
"uri": "arcp://uuid,0e6cb79e-fe70-4807-888c-3a61b9bf232a/data/53/53870991af88a6d678cbeed3255bb65993c52925",
"folder": "/data/53/",
"filename": "53870991af88a6d678cbeed3255bb65993c52925"
}
},
Data outputs (generation)
^^^^^^^^^^^^^^^^^^^^^^^^^
Similarly a step typically generates some data, here ``response``::
activity(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, -, [prov:type='wfprov:ProcessRun', prov:label="Run of workflow/packed.cwl#main/step0"])
wasGeneratedBy(data:53870991af88a6d678cbeed3255bb65993c52925, id:a583b025-9a16-49ce-8515-f3249eb2aacf, 2018-04-16T18:27:09.438236, [prov:role='wf:main/step0/response'])
In the hello world example this is interesting because it is the same data output as-is, but typically the outputs will each have different checksums (and thus different identifiers).
The step is ended::
wasEndedBy(id:a583b025-9a16-49ce-8515-f3249eb2aacf, -, id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.438482)
In this case the step output is also a workflow output ``response``, so the data is also generated by the workflow activity::
activity(id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.428165, -, [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
wasGeneratedBy(data:53870991af88a6d678cbeed3255bb65993c52925, id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.439323, [prov:role='wf:main/response'])
Ending the workflow
^^^^^^^^^^^^^^^^^^^
Finally the overall workflow ``#main`` also ends::
activity(id:67f38794-d24a-435f-bd4a-0242a56a581b, 2018-04-16T18:27:09.428165, -, [prov:type='wfprov:WorkflowRun', prov:label="Run of workflow/packed.cwl#main"])
agent(id:7c9d9e88-666b-4977-85f4-c02da08a942d, [prov:type='prov:SoftwareAgent', prov:type='wfprov:WorkflowEngine', prov:label="cwltool 1.0.20180416145054"])
wasEndedBy(id:67f38794-d24a-435f-bd4a-0242a56a581b, -, id:7c9d9e88-666b-4977-85f4-c02da08a942d, 2018-04-16T18:27:09.445785)
Note that the end of the outer ``cwltool`` activity is not recorded, as cwltool is still running at the point of writing out this provenance.
Currently the provenance trace do not distinguish executions within nested workflows; it is planned that these will be tracked in separate files under ``metadata/provenance/``.