Autoflow#
As apps are defined as regular Python functions, we decided to use Python also to define flows. The main goal was to alleviate users from manually defining dependencies between flow components.
The idea is to turn processors of apps into functions stubs - empty functions that copies the signature of the processor. This enables two important things:
The function stub gives the user a hint of what the processor expects as input and what it returns as output. It also copies docstrings. So function is self-contained and its purpose is clear.
As stub does not have any implementation, you are not required to install any dependencies that are necessary to run the processor. Once you installed
malevich
package you can use a processor of any complexity (API calls, model inference, etc.) without installing extra packages.
To better understand what happens to apps when they are used in flows, consider the following processor:
from transformers import AutoTokenizer
@processor()
def tokenize(text: DF, ctx: Context) -> DF:
"""Tokenize texts using BERT tokenizer."""
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
# Text pre-processing
text = ...
# Actual tokenization
tokens = ...
# Return a dataframe with tokens
return pd.DataFrame({
"tokens": ...
})
Once you install app, a function stub is created in the malevich
package. The stub
is a regular Python function that copies the signature of the processor and has no implementation:
from malevich._autoflow.function import autotrace,
@autotrace
def tokenize(text, config: dict = {}):
"""Tokenize texts using BERT tokenizer."""
return OperationNode(...)
So, to define a flow with this processor, you do not need to install transformers package. Simply, import the function stub and use it as a regular Python function:
from malevich.my_app import tokenize
@flow()
def tokenize_flow():
texts = collection(...)
# Calling tokenize function stub
# will create an operation node
tokens = tokenize(texts)
return tokens
task = tokenize_flow()
task.interpret()
# Remote execution
print(task())
Note
Such a simple design achieved by Autoflow - a special engine that enables dependency tracking. Whenever you
call one of special Malevich functions: collection
, asset.file
, asset.mutltifile
, or
a processor stub, you produce a special kind of entities - tracers. They are wrappers around arbitrary objects, which
hold a reference to current plan of execution and advance it when you passed to other stubs.
To be precise, in the example above, texts
is a traced object, and when it is passed to function stub tokenize
,
a new dependency (texts → tokenize) is registred within the flow.
Warning
Beware that tokens
has nothing to do with the actual result of the tokenize
processor. It is just a
placeholder, that is used to define a dependency between texts
and tokenize
. The actual result of the
flow retrieved by returning tokens
from the flow function, running the flow, and requesting results. See Working with Results for more details.