================== Processors ================== Processors are the core logic units of apps. They are responsible for processing data and generating output. Processors receive input in the form of assets, collections and their combinations. Assets can be understood as files or folders, while the term collection refers to tabular data. Processor can have multiple inputs and can output multiple objects. How to define a processor? ++++++++++++++++++++++++++ To define a processor, you have to decorate a function with the `@processor` decorator. .. code-block:: python from malevich.square import processor, Context @processor() def my_processor(input1, input2, context: Context): # do something with input1 and input2 return output1, output2 The function have to follow the following conventions: 1. Each of argument but the one annotated with :code:`context: Context` has to be a reference to a particular input. The input can be either an output of a previous processor, a collection, or an asset. 2. The function has to return either a single dataframe or a tuple of dataframes. Each of the input references to the output of exactly one previous processor or collection. Assume, the following pipeline: .. mermaid:: graph LR A[train_model] --> |"| model, metrics |"| B[predict] C[prediction_data] --> B[predict] and the following code: .. code-block:: python from malevich.square import processor @processor() def train_model(data): ... return model, metrics @processor() def predict(train_outcome, data_for_prediction): model, metrics = train_outcome ... return predictions In this case, :code:`train_outcome` refers to the output of :code:`train_model` and :code:`data_for_prediction` refers to data in :code:`prediction_data` collection. To access model and metrics, you have to unpack the :code:`train_outcome` variable. DF, DFS, Sink and OBJ +++++++++++++++++++++ Malevich makes use of specific data types when passing data between processors. Each of these types denote a specific entity that processor can receive as an input or return as an output. * :class:`DF ` - a single instance of tabular data. The table can follow a specific schema. * :class:`DFS ` - a collection of tabular data. The collection can be bound by a specific number of tables or be unlimited. Also, it can impose a schema on each table. * :class:`Sink ` - a collection of DFS that allows you to denote a processor capable of being link to unbounded number of processors. * :class:`OBJ ` - a collection of files that can hold arbitrary binary data. See, how they are applied in the following example: .. code-block:: python from malevich.square import processor, DF, Sink, OBJ, obj @processor() def train_model(data: DF['TrainData']) -> tuple[OBJ, DF['Metrics']]: ... return model, metrics @processor() def predict( train_outcome: DFS['obj', 'Metrics'], data_for_prediction: DF["ValidationData"] ) -> DF["Predictions"]: model, metrics = train_outcome ... return predictions Context schema ++++++++++++++ You may define a schema for the context object. .. code-block:: python from malevich.square import processor, Context, schema @schema() # Makes the class pydantic class MySchema: param1: str param2: int @processor() def my_processor(input1, input2, context: Context[MySchema]): context.app_cfg.param1 # Access to the app configuration as a model Dataframe schema ++++++++++++++++ Also, you may define a schema for the dataframe. You may use primitive types to define columns you expect in the dataframe. .. code-block:: python from malevich.square import processor, DF, schema @schema() class MySchema: total: str title: int @processor() def my_processor(input1: DF[MySchema], input2: DF[MySchema]): pass The inputs will be validated and remapped if possible. For example, the processor was called with a dataframe with columns `num_elements` of type `int`, and `text` of type `str`, the processor will remap the columns to `title` and `total` respectively. The rules of remapping are as follows: 1. If the number of columns are the same, and the order of column types is the same, the columns will be remapped. 2. Columns with the same name will be mapped to each other, the rest will be mapped by the first rule.