Concepts and Features¶

Flows, FlowBuilders, and Entities¶

The two fundamental objects in Bionic are the Entity and the Flow. An entity is any of the Python objects that makes up an analysis: a dataframe, a parameter, a model, a plot, even a database connection. Each entity has a unique name and a value, which may be either fixed or derived from other entities.

For example, a very simple analysis might have four entities:

some raw data (defined as a fixed value)
the cleaned-up data (a function of the raw data)
a statistical model (a function of the clean data)
a plot showing the fit of the model (a function of the model and the clean data)

When grouped together in an analysis, these entities make up a flow. The goal of Bionic is to make it easy to assemble flows, run them to compute their component entities, and eventually share and re-use them.

Bionic has two classes for representing flows: Flow and FlowBuilder. They both represent the same data model, but FlowBuilder is mutable and intended for building a flow, while Flow is immutable and intended for running or sharing a flow.

The basic mechanics of building and running flows are illustrated in the Hello World tutorial.

Declaring, Setting, Assigning, and Deriving Entities¶

Once a FlowBuilder is created, entities can be defined and updated in a few different ways:

import bionic as bn

builder = bn.FlowBuilder('my_flow')

# Creates a new entity and assigns it a fixed value.
builder.assign('greeting', 'Hello')

# Declares an entity but doesn't set any value.
builder.declare('subject')

# Sets the value of an existing entity.  Can overwrite previously-set
# values.
builder.set('subject', 'world')

# Creates a new entity whose value is derived from other entities.
# The entity name and its dependencies are inferred from the function
# name and arguments.
# (Can also overwrite an existing entity definition.)
@builder
def message(greeting, subject):
    return f'{greeting} {subject}!'

The point of the distinction between declare, set, and assign is to make it explicit whether you’re creating a new entity or updating an existing one. In particular, when working with a flow created by someone else, it would otherwise be easy to attempt to change an existing entity value, but mistype the name and create a new, unused entity.

Building And Running Flows¶

To run a flow, we build the FlowBuilder into a Flow, then use get to compute the value of any of the entities:

flow = builder.build()

# Returns 'Hello world!'
flow.get('message')

Modifying Built Flows¶

Although Flow objects are immutable, they provide a setting method that can be used to create a modified copy of a flow:

new_flow = flow.setting('greeting', 'Goodbye').setting('subject', 'galaxy')

# Returns "Goodbye galaxy!"
new_flow.get('message')

# Still returns "Hello world!"
flow.get('message')

If more extensive changes are needed (such as creating new entities, or setting derived entities), a Flow can also be converted back to a mutable FlowBuilder:

new_builder = flow.to_builder()

@new_builder
def loud_message(message):
    return message.upper()

# Returns "HELLO WORLD!"
new_builder.build().get('loud_message')

Defining Multiple Outputs Using Decorators¶

When creating derived entities, Bionic infers the inputs and output of your entity from the Python function you provide. This provides a very convenient way to define relationships between entities – but sometimes we want to specify more complicated behavior. For these cases Bionic provides special decorators.

For example, sometimes we have one function that returns multiple distinct values. These can be assigned to different entities with the @outputs decorator:

@builder
@bn.outputs('first_name', 'last_name')
def split_name(full_name):
    first_name, last_name = full_name.split()
    return first_name, last_name

(Since we’re explicitly providing the names of the output entities, the name of the function is ignored here.)

Bionic provides several built-in decorators that modify how a function is interpreted and converted to an entity (or entities). In the future, it will be possible for users to write their own decorators as well.

Documenting Entities¶

Each Bionic entity can optionally have a documentation string associated with it. Entities defined by functions can use the regular Python docstring syntax:

@builder
def message(greeting, subject):
    """A nice thing to say to someone."""
    return f'{greeting} {subject}!'

If the function defines multiple entities, the @docs decorator can be used to specify documentation for each one:

@builder
@bn.outputs('first_name', 'last_name')
@bn.docs('The first name.', 'The last name.')
def split_name(full_name):
    first_name, last_name = full_name.split()
    return first_name, last_name

For entities with fixed values, an optional doc argument is available:

builder.assign('greeting', 'Hello', doc="A nice way to start a message.")
builder.declare('subject', doc="The person we're talking to.")

These documentation strings are helpful for people reading your code, and are sometimes visible to the users of your flow. For example, Python’s built-in help method can be used to view an entity’s documentation:

help(flow.get.message)

Configuration with Internal Entities¶

In additional to the entities defined by the user, each Flow has a collection of “internal” entities which control its behavior. For example, the built-in core__persistent_cache__global_cache_dir entity controls the location of Bionic’s persistent cache. Internal entities are usually omitted from user-facing lists and visualizations, but they can be accessed and modified by name just like regular entities.

Caching and Protocols¶

Whenever Bionic computes an entity’s value, it automatically caches that value in memory (in case you access it again in from the same Flow object) and to disk (in case you want to access it later, perhaps after restarting your script or notebook). Bionic can also be configured to cache to Google Cloud Storage.

Bionic’s caching can be seen in action in the ML tutorial.

Cache Invalidation and Versioning¶

If parts of your flow change, old cached entries may become invalid and need to be recomputed. This is not an issue for the in-memory cache – it is associated with a specific Flow object, which is immutable, so if you create a new Flow instance its in-memory cache will be empty. However, with the persistent cache, the situation is more involved.

There are three ways for a cached value to become invalid:

A new value is defined for that entity, such as via FlowBuilder.set or Flow.setting.
The entity is defined as a function, and one of its dependencies becomes invalid.
The entity is defined as a function, and the code of that function is changed.

Bionic can detect cases 1 and 2 automatically: if you update the value of any entity in your flow, all downstream cached values will automatically be invalidated, and they will be recomputed from scratch next time they’re requested 1. However, case 3 is difficult to detect automatically, so we provide a special @version decorator to tell Bionic when a function’s code has changed. For example, if we’ve defined a message entity:

@builder
def message(greeting, subject):
    return f'{greeting} {subject}!'

If we want to change the code that generates message, we attach the decorator:

@builder
@bionic.version(1)
def message(greeting, subject):
    return f'{greeting} {subject}!!!'.upper()

If the function has a different version from the cached value, the cached value will be disregarded and a new value will be recomputed. Each subsequent time we change this function, we just increment the version number.

1: Bionic detects changes by hashing all of the fixed entity values, and storing each computed value alongside a hash of all its inputs.

Automatic Versioning¶

New in version 0.5.0.

Note

This feature is somewhat experimental. However, if it proves useful, we may make assisted versioning the default behavior in the future.

By default, Bionic expects you to manually update a version decorator each time you modify a function’s code. However, it can be configured to automatically detect code changes and warn you if the code changes but the version doesn’t. This “assisted versioning” behavior is enabled by changing Bionic’s versioning mode from 'manual' to 'assist':

builder.set('core__versioning_mode', 'assist')

In this mode, if Bionic finds a cached file created by a function with the same version but different code 2, it will raise a CodeVersioningError. You can resolve this error by updating the @version, which tells Bionic to ignore the cached file and compute a new value.

# Trying to compute this new version of ``message`` will throw an exception.
@builder
def message(greeting, subject):
    return f'{greeting} {subject}!!!'.upper()

# With the version updated, Bionic knows to recompute this.
@builder
@bionic.version(1)
def message(greeting, subject):
    return f'{greeting} {subject}!!!'.upper()

However, some code changes, such as refactoring or performance optimizations, have no effect on the function’s behavior; in this case we might prefer to keep using the cached value. If you’re confident that your change has no effect, you can provide a minor argument to @version. Bionic only uses the first argument (“major”) for cache invalidation; updating the minor argument tells Bionic to ignore the code differences and keep using any cached file as long as the major version matches.

# Even though we changed the code, Bionic won't recompute this.
@builder
@bionic.version(major=1, minor=1)
def message(greeting, subject):
    return f'{greeting} {subject}!!!'.upper()

Be aware that Bionic can’t detect every change that can affect your code’s behavior. It only looks at the code of the decorated function itself; if you change any other function or library that your decorated function depends on, Bionic won’t notice. Similarly, if your function is wrapped by a non-Bionic decorator, Bionic won’t detect any code changes in that function at all. That’s why this mode only provides a warning, rather than automatically invalidating the cache for you: to keep you in the habit of thinking carefully about versioning.

However, if you do want to completely automate the versioning process, you can set Bionic to a “fully automatic” mode:

builder.set('core__versioning_mode', 'auto')

In this mode, Bionic will automatically invalidate cached files whenever a function’s code changes, so you don’t need to set a @version at all. (However, you can still update the @version to tell Bionic about external changes that it can’t detect.) This mode is more dangerous, but can be useful when your functions are small, change fast, and have few external dependencies – for example, when your flow is defined in a notebook.

2: Bionic detects code changes by extracting and hashing the Python bytecode of each function decorated by a FlowBuilder.

Disabling Persistent Caching¶

In some cases, it doesn’t make sense to make a persistent copy of an entity’s value, either because the value is much cheaper to compute than to store, or because the value has a type that’s difficult to serialize. In these cases, we can disable persistent caching altogether:

@builder
@bionic.persist(False)
def message(subject):
    return f'Hello {subject}.'

If your goal is just to force an entity to be recomputed more frequently, you may want @changes_per_run instead.

Disabling In-Memory Caching¶

In other cases, it might be useful to not keep an entity in memory, but store it on disk and load it only when needed for downstream computation. This might be useful when it is too expensive to keep all entities in memory. In these cases, we can disable in-memory caching:

@builder
@bionic.memoize(False)
def message(subject):
    return f'Hello {subject}.'

In-memory caching can also be globally disabled:

builder.set('core__memoize_by_default', False)

This only changes the default behavior, so it can be explicitly re-enabled for individual entities:

builder.set('core__memoize_by_default', False)

@builder
@bionic.memoize(True)
def message(subject):
    return f'Hello {subject}.'

Non-Deterministic Computation¶

New in version 0.7.0.

The basic assumption behind Bionic’s caching behavior is that entity functions are deterministic: if you call them multiple times with the same input, they always return the same output. However, some functions are non-deterministic: their output can change even when their input doesn’t. For example, a function that retrieves data from an external database may return different results whenever the database’s contents change. In cases like this, it’s not appropriate to reuse the function’s previous cached values; we want Bionic to recompute the value each time.

You can tell Bionic that a function is non-deterministic by applying the @changes_per_run decorator:

@builder
@bn.changes_per_run
def current_data():
    return download_data()

This causes Bionic to recompute the entity’s value instead of loading a cached value from disk. (However, this recomputation will only happen once for any given Flow instance; after that, the value will be cached in memory and reused 3.)

3: I.e., the value is computed once per “run”. This is a compromise: although it makes logical sense to recompute the value every single time, it’s much simpler for each entity to have a consistent value within a single flow instance.

`@changes_per_run` vs `@persist`¶

Note that @changes_per_run has a different effect from @persist(False). If an entity is decorated with @persist(False), Bionic will never cache its value to disk, but it will still assume that its output is deterministic. The difference can be seen when we add a downstream entity:

@builder
@bn.persist(False)
def current_data():
    return download_data()

@builder
def summary(current_data):
    return summarize(current_data)

In this case, builder.build().get('current_data') will always recompute current_data, since its value is never persisted. However, builder.build().get('summary') will use a cached value if one is available; Bionic won’t bother to recompute current_data because it assumes its value will be the same anyway. In more complex flows, this incorrect assumption may lead to inconsistent results.

By contrast, if we use the appropriate decorator, @bn.changes_per_run:

@builder
@bn.changes_per_run
def current_data():
    return download_data()

@builder
def summary(current_data):
    return summarize(current_data)

Here builder.build().get('summary') will always recompute current_data first. Then, if current_data’s value has changed, it will recompute summary as well; otherwise it will use a cached value.

As a rule: use @persist(False) for entities whose values are impossible to serialize or not worth serializing. Use @changes_per_run for entities whose values are non-deterministic.

Location of the Cache Directory¶

By default, Bionic persists cached values on the local disk, in a directory called bndata/$NAME_OF_FLOW. This can be configured by modifying one of two internal entities:

builder = bionic.FlowBuilder('my_flow')

# Cache this flow's data in /my_cache_dir/my_flow/
builder.set('core__persistent_cache__global_dir', 'my_cache_dir')

# Cache this flow's data in /my_cache_dir/
builder.set('core__persistent_cache__flow_dir', 'my_cache_dir')

Caching in Google Cloud Storage¶

Bionic can be configured to cache to Google Cloud Storage as well as on the local filesystem:

builder = bionic.FlowBuilder('my_flow')

# You need to have an existing, accessible GCS bucket already.
builder.set('core__persistent_cache__gcs__bucket_name', 'my-bucket')
builder.set('core__persistent_cache__gcs__enabled', True)

By default, Bionic stores its cached files with a prefix of $NAME_OF_USER/bndata/$NAME_OF_FLOW/; this can be configured by setting the core__persistent_cache__gcs__object_path entity:

builder.set('core__persistent_cache__gcs__bucket_name', 'my-bucket')
builder.set('core__persistent_cache__gcs__object_path', 'my/path/')
builder.set('core__persistent_cache__gcs__enabled', True)

Alternatively, a single GCS URL can be provided:

builder.set('core__persistent_cache__gcs__url', 'gs://my-bucket/my/path/')
builder.set('core__persistent_cache__gcs__enabled', True)

Bionic will load data from the GCS cache whenever it’s not in the local cache, and will write back to both caches. Note that the upload time will make each entity computation a bit slower.

In order to use GCS caching, you must have the gsutil tool installed, and you must have GCP credentials configured. You should also use pip install 'bionic[gcp]' to install the required Python libraries.

Serialization Protocols¶

In order to persistently cache an entity’s value – which is a Python object – Bionic needs to serialize the value, converting it to a series of bytes which can be stored in a file. Conversely, to retrieve the value from the cache, those bytes need to be deserialized back into a Python object. The best way to serialize and deserialize a given value depends on its type.

Most Python objects can be serialized with Python’s built-in pickle module. However, for some object types it’s more efficient or more idiomatic to use a different format. There are also some types of objects that can’t be pickled at all. Bionic uses pickle by default, but handles some types specially:

Pandas DataFrames are serialized as Parquet files.
NumPy Arrays are serialized as NPY files.
Pillow Images are serialized as PNG files.
Dask Dataframes are serialized as Parquet files.
GeoPandas Dataframes are serialized as SHP files.

You can can explictly specify a serialization strategy for an entity by attaching a Protocol to its definition.

Retrieving Persisted Files¶

In some cases, you’ll want to directly access the persisted file(s) for an entity rather than its in-memory representation. (For example, if you’re writing a paper or report, you may want to access the files containing the plots.) This can be achieved with the mode argument to Flow.get method. For example:

flow = builder.build().setting('subject', 'Alice')
flow.get('subject', mode='path')

This would return a Path object for the subject entity.

Programmatic Cache Access¶

New in version 0.8.0.

Note

This API is intentionally quite minimal; we intend to add additional convenience features based on observed usage patterns. If you’d like to add new features, feel free to submit an issue or a PR on GitHub!

Although Bionic attempts to manage the cache for you automatically, it’s sometimes helpful to be able to interact with it directly. Bionic provides a basic API for exploring the cache:

for entry in flow.cache.get_entries():
    print(entry.artifact_url)

The get_entries method returns a sequence of CacheEntry objects, one for each cached entity value. These objects contain information about the cached entity and the location of the cache file itself (which may be either a local file or a cloud blob).

Cached entries can also be safely deleted using the delete method. This can be used to selectively clean up the cache:

for entry in flow.cache.get_entries():
    if entry.tier == 'local' and entry.entity == 'model':
        entry.delete()

Multiplicity¶

So far we’ve only considered flows where each entity has a single value. However, often we want several instances of a particular part of our flow. To facilitate this, Bionic allows any entity to be assigned multiple values at once:

flow = builder.build()
flow2 = flow.setting('subject', values=['Alice', 'Bob'])

If an entity has multiple values, we have to tell Bionic that we expect a collection of values when we retrieve it:

# Returns `{'Alice', 'Bob'}`.
flow2.get('subject', 'set')

The “multiplicity” of the subject entity is propagated to all downstream entities as well:

# Returns `{'Hello Alice!', 'Hello Bob!'}`.
flow2.get('message', 'set')

This can also be used on multiple entities at once:

flow4 = flow2.setting('greeting', values=['Hello', 'Hi'])

# Returns `{'Hello Alice!', 'Hello Bob!', 'Hi Alice!', 'Hi Bob!}`.
flow4.get('message', 'set')

The multiplicity feature is illustrated in more detail later in the ML tutorial.

The Relational Model of Multiplicity¶

Bionic uses a relational model to determine how many instances of each entity to create. In essence, each entity has a “table” of values. For fixed entities, the values are provided explicitly by the user; for derived entities, they are constructed by a join-like operation on the entity’s dependencies’ tables.

For example, in the previous flow, we had two values of greeting and two values of subject, producing four values of message – one for each combination. In other words, we took the Cartesian product of all possible inputs for the message entity.

However, Bionic will only combine values that are “compatible” with each other. For example:

builder.set('full_name', values=['Alice Adams', 'Bob Baker'])

@builder
def first_name(full_name):
    return full_name.split()[0]

@builder
def last_name(full_name):
    return full_name.split()[-1]

@builder
def reversed_name(first_name, last_name):
    return f'{last_name}, {first_name}'

flow = builder.build()

# Returns `{'Adams, Alice', 'Baker, Bob'}`.
flow.get('reversed_name', 'set')

Even though reversed_name depends on first_name and last_name, and they each have two values, we don’t use every possible combination. Since first_name and last_name share an ancestor, we only combine values derived from the same ancestor value. "Alice" and "Baker" are derived from different full_names, so they won’t be combined together.

Gathering¶

Often, if we have multiple instances of an entity, we eventually want to aggregate those instances together and compare them somehow. This is the function of the @gather decorator.

Returning to the “hello world” example:

builder.set('greeting', values=['Hello', 'Hi'])
builder.set('subject', values=['Alice', 'Bob'])

# Returns `{'Hello Alice!', 'Hello Bob!', 'Hi Alice!', 'Hi Bob!}`.
builder.build().get('message', 'set')

@builder
@bn.gather(over='subject', also='message', into='gather_df')
def message_for_all_subjects(gather_df):
    messages = gather_df.sort_values('subject')['message']
    return ' '.join(messages)

# Return `{'Hello Alice! Hello Bob!', 'Hi Alice! Hi Bob!'}`
builder.build().get('message_for_all_subjects', 'set')

The effect of @gather here is to “gather” together all the different instances of subject into a single dataframe, along with the associated values of message. Our message_for_all_subjects function then combines those messages together into a single message. The final result is an entity with two distinct values.

Essentially, we create multiplicity with the values= keyword, and we remove it with the @gather decorator. In this example, we created multiplicity across two dimensions (greeting and subject), and then removed one dimension (subject), leaving one dimension remaining (greeting).

Notice also that @gather is treating the over argument differently from the also argument; both are included in the dataframe, but only the former affects the multiplicity of the resulting entity. (Incidentally, either of these arguments can also accept a list of strings instead of a single string.)

This model of multiplicity takes some getting used to, but the payoff is that we only have to think about multiplicity in two places: where we create it, and where we remove it. Any intermediate entities are oblivious to how many times they’re instantiated. This quality is also demonstrated in the same tutorial section.

Case-by-Case Assignment¶

Normally, Bionic infers which entity values can be combined with others based on their ancestry. However, sometimes we want to explicitly specify which values are “compatible” with each other. In the situations, we can assign values by “case” instead of by entity.

builder.declare('color')
builder.declare('animal')

builder.add_case('color', 'black', 'animal', 'cat')
builder.add_case('color', 'brown', 'animal', 'cat')
builder.add_case('color', 'brown', 'animal', 'fox')

@builder
def colored_animal(color, animal):
    return f'{color} {animal}'

# Returns `{'black cat', 'brown cat', 'brown fox'}`.
builder.build().get('colored_animal', 'set')

Other Features¶

Parallel Execution¶

New in version 0.8.0.

Requesting an entity value with Flow.get can lead to a long computation, as Bionic may need to compute that entity’s dependencies, and their dependencies, and so on. By default, Bionic computes these values one at a time. However, it can also be configured to compute them in parallel; depending on the structure of your flow, this can be significantly faster.

Parallel execution can be enabled like this:

builder.set("core__parallel_execution__enabled", True)

When parallel execution is enabled, Bionic starts up several worker processes 4, each of which can work on one value at a time. Of course, a worker can only start computing a value once all its dependencies are complete, so the number of processes that can be working at once depends on the dependency graph: if there aren’t many branches in the graph, then most of the processes won’t do much work. It does take extra time to set up the processes and move information between them, so parallel execution is not guaranteed to be faster overall. However, in general, if you have many expensive operations which don’t depend on each other, enabling parallelism will improve performance.

By default, Bionic will create one worker process for each CPU on your machine. This is usually a sensible number, but it can also be set directly:

builder.set("core__parallel_execution__worker_count", 8)

In order to compute an entity value in a separate process, Bionic needs to serialize the entity function and transmit it to the other process; thus, all your functions need to be serializable by cloudpickle. (This shouldn’t be a problem unless your function uses some kind of complex global variable, which is already a bad idea.) The entity value itself doesn’t necessarily need to be picklable; it will be serialized using the protocol specified for the entity. Finally, entities marked with @persist(False) are assumed to be unserializable and will always be computed in the main process rather than being parallelized.

4: The pool of workers is managed by Loky, which is built on Python’s multiprocessing module. The pool is global and reusable, so it should only need to be initialized once in the lifetime of the main process.

Plotting¶

Bionic is based on a functional paradigm: the only important thing about a function is the value it returns, rather than any side effects it might have. However, some plotting libraries – most notably Matplotlib – don’t work like this. Instead, they maintain a global, stateful canvas which the user incrementally writes to and then visualizes.

Since plotting is a crucial part of data analysis, Bionic bridges this gap by providing a @pyplot decorator, which translates a function using the Matplotlib API into a regular Bionic entity whose value is a Pillow Image object.

@builder
@bn.pyplot('my_plt')
def my_plot(dataframe, my_plt):
    my_plt.scatter(x=dataframe['time'], y=dataframe['profit'])

# Returns an Image object containing the plot.
builder.build().get('my_plot')

Logging¶

Bionic uses the built-in Python logging module to log what it does. Currently it doesn’t attempt to configure any log handlers, since that’s conventionally the responsibility of the application rather than a library. This means that you will only see log messages that meet Python’s default severity threshold: WARNING and above. To see a running log of what Bionic is computing, set the threshold to INFO. You can do this with the bionic.util.init_basic_logging convenience function.

In the future, Bionic will probably have a configurable option to initialize the logging state itself. It will also provide an easy way for entity functions to access individually-named loggers, rather than having to create them themselves.

Reloading Flows in Notebooks¶

One of Bionic’s design goals is to make it easy for flows to be defined in Python module files but accessed in notebooks. However, one challenge is that when a module file is updated, the change is not reflected in the notebook – instead, the module has to be manually reloaded, and then the flow object has to be re-imported.

from my_module import flow

...

import my_module
reload(my_module)
from my_module import flow
flow.get('my_entity')

(Jupyter’s autoreload doesn’t work here, because after reloading we still need to re-import the flow.)

To address this, the Flow.reloading method can be used:

from my_module import flow

...

flow = flow.reloading()
flow.get('my_entity')

This attempts to reload all modules associated with the flow, and then return a re-imported version of the flow. (This is a fairly magical procedure – in complicated cases, it may not be able to figure out how to do this. In these cases it will try to throw an exception rather than fail silently.)

Combining Flows¶

When building a flow, you can import entities from another flow using the merge method:

builder.merge(flow)

This allows you to extend the functionality of a flow, or to combine multiple flows into one. You can also combine two already-built flows using the analogous merging method.

If the two flows being merged have any entity names in common and Bionic can’t figure out which one to keep, it will throw an exception. You can resolve the conflict by using the keep argument to specify which definitions to keep:

builder.merge(flow, keep='old')

Visualizing Flows¶

Bionic can visualize any flow as a directed acyclic graph, or “DAG”:

flow.render_dag()

Each entity in the flow is represented as a box, with arrows representing dependencies (the arrow points from the depended-on entity to the depending one). See the ML tutorial an example. This functionality requires the Graphviz library.