Autoflow#

As apps are defined as regular Python functions, we decided to use Python also to define flows. The main goal was to alleviate users from manually defining dependencies between flow components.

The idea is to turn processors of apps into functions stubs - empty functions that copies the signature of the processor. This enables two important things:

  1. The function stub gives the user a hint of what the processor expects as input and what it returns as output. It also copies docstrings. So function is self-contained and its purpose is clear.

  2. As stub does not have any implementation, you are not required to install any dependencies that are necessary to run the processor. Once you installed malevich package you can use a processor of any complexity (API calls, model inference, etc.) without installing extra packages.

To better understand what happens to apps when they are used in flows, consider the following processor:

from transformers import AutoTokenizer

@processor()
def tokenize(text: DF, ctx: Context) -> DF:
    """Tokenize texts using BERT tokenizer."""

    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    # Text pre-processing
    text = ...
    # Actual tokenization
    tokens = ...
    # Return a dataframe with tokens
    return pd.DataFrame({
        "tokens": ...
    })

Once you install app, a function stub is created in the malevich package. The stub is a regular Python function that copies the signature of the processor and has no implementation:

from malevich._autoflow.function import autotrace,


@autotrace
 def tokenize(text, config: dict = {}):
    """Tokenize texts using BERT tokenizer."""

    return OperationNode(...)

So, to define a flow with this processor, you do not need to install transformers package. Simply, import the function stub and use it as a regular Python function:

from malevich.my_app import tokenize

@flow()
def tokenize_flow():
    texts = collection(...)
    # Calling tokenize function stub
    # will create an operation node
    tokens = tokenize(texts)

    return tokens

task = tokenize_flow()
task.interpret()

# Remote execution
print(task())

Note

Such a simple design achieved by Autoflow - a special engine that enables dependency tracking. Whenever you call one of special Malevich functions: collection, asset.file, asset.mutltifile, or a processor stub, you produce a special kind of entities - tracers. They are wrappers around arbitrary objects, which hold a reference to current plan of execution and advance it when you passed to other stubs.

To be precise, in the example above, texts is a traced object, and when it is passed to function stub tokenize, a new dependency (texts → tokenize) is registred within the flow.

Warning

Beware that tokens has nothing to do with the actual result of the tokenize processor. It is just a placeholder, that is used to define a dependency between texts and tokenize. The actual result of the flow retrieved by returning tokens from the flow function, running the flow, and requesting results. See Working with Results for more details.