Concepts and Features¶
Flows, FlowBuilders, and Entities¶
The two fundamental objects in Bionic are the Entity and the Flow. An entity is any of the Python objects that makes up an analysis: a dataframe, a parameter, a model, a plot, even a database connection. Each entity has a unique name and a value, which may be either fixed or derived from other entities.
For example, a very simple analysis might have four entities:
some raw data (defined as a fixed value)
the cleaned-up data (a function of the raw data)
a statistical model (a function of the clean data)
a plot showing the fit of the model (a function of the model and the clean data)
When grouped together in an analysis, these entities make up a flow. The goal of Bionic is to make it easy to assemble flows, run them to compute their component entities, and eventually share and re-use them.
Bionic has two classes for representing flows: Flow
and
FlowBuilder
. They both represent the same data
model, but FlowBuilder
is mutable and intended for building a flow,
while Flow
is immutable and intended for running or sharing a flow.
The basic mechanics of building and running flows are illustrated in the Hello World tutorial.
Declaring, Setting, Assigning, and Deriving Entities¶
Once a FlowBuilder is created, entities can be defined and updated in a few different ways:
import bionic as bn
builder = bn.FlowBuilder('my_flow')
# Creates a new entity and assigns it a fixed value.
builder.assign('greeting', 'Hello')
# Declares an entity but doesn't set any value.
builder.declare('subject')
# Sets the value of an existing entity. Can overwrite previously-set
# values.
builder.set('subject', 'world')
# Creates a new entity whose value is derived from other entities.
# The entity name and its dependencies are inferred from the function
# name and arguments.
# (Can also overwrite an existing entity definition.)
@builder
def message(greeting, subject):
return f'{greeting} {subject}!'
The point of the distinction between declare
, set
, and assign
is to
make it explicit whether you’re creating a new entity or updating an existing
one. In particular, when working with a flow created by someone else, it
would otherwise be easy to attempt to change an existing entity value, but
mistype the name and create a new, unused entity.
Building And Running Flows¶
To run a flow, we build
the FlowBuilder
into a Flow
, then use
get
to compute the value of any of the entities:
flow = builder.build()
# Returns 'Hello world!'
flow.get('message')
Modifying Built Flows¶
Although Flow
objects are immutable, they provide a setting
method that can be used to create a modified copy of a
flow:
new_flow = flow.setting('greeting', 'Goodbye').setting('subject', 'galaxy')
# Returns "Goodbye galaxy!"
new_flow.get('message')
# Still returns "Hello world!"
flow.get('message')
If more extensive changes are needed (such as creating new entities, or setting
derived entities), a Flow
can also be converted back to a mutable
FlowBuilder
:
new_builder = flow.to_builder()
@new_builder
def loud_message(message):
return message.upper()
# Returns "HELLO WORLD!"
new_builder.build().get('loud_message')
Defining Multiple Outputs Using Decorators¶
When creating derived entities, Bionic infers the inputs and output of your entity from the Python function you provide. This provides a very convenient way to define relationships between entities – but sometimes we want to specify more complicated behavior. For these cases Bionic provides special decorators.
For example, sometimes we have one function that returns multiple distinct
values. These can be assigned to different entities with the @outputs
decorator:
@builder
@bn.outputs('first_name', 'last_name')
def split_name(full_name):
first_name, last_name = full_name.split()
return first_name, last_name
(Since we’re explicitly providing the names of the output entities, the name of the function is ignored here.)
Bionic provides several built-in decorators that modify how a function is interpreted and converted to an entity (or entities). In the future, it will be possible for users to write their own decorators as well.
Configuration with Internal Entities¶
In additional to the entities defined by the user, each Flow
has a
collection of “internal” entities which control its behavior. For example,
the built-in core__persistent_cache__global_cache_dir
entity controls the
location of Bionic’s persistent cache. Internal entities are usually omitted
from user-facing lists and visualizations, but they can be accessed and
modified by name just like regular entities.
Caching and Protocols¶
Whenever Bionic computes an entity’s value, it automatically caches that value
in memory (in case you access it again in from the same Flow
object) and in
persistent storage (in case you want to access it later, perhaps after
restarting your script or notebook). Currently the only supported location for
persistent storage is your computer’s hard disk.
Bionic’s caching can be seen in action in the ML tutorial.
Cache Invalidation and Versioning¶
If parts of your flow change, old cached entries may become invalid and need to
be recomputed. This is not an issue for the in-memory cache – it is
associated with a specific Flow
object, which is immutable, so if you
create a new Flow
instance its in-memory cache will be empty. However,
with the persistent cache, the situation is more involved.
There are three ways for a cached value to become invalid:
A new value is defined for that entity, such as via
FlowBuilder.set
orFlow.setting
.The entity is defined as a function, and one of its dependencies becomes invalid.
The entity is defined as a function, and the code of that function is changed.
Bionic can detect cases 1 and 2 automatically: if you update the value of any
entity in your flow, all downstream cached values will automatically be
invalidated, and they will be recomputed from scratch next time they’re
requested 1. However, case 3 is difficult to detect automatically, so we
provide a special @version
decorator to tell Bionic
when a function’s code has changed. For example, if we’ve defined a
message
entity:
@builder
def message(greeting, subject):
return f'{greeting} {subject}!'
If we want to change the code that generates message
, we attach the
decorator:
@builder
@bionic.version(1)
def message(greeting, subject):
return f'{greeting} {subject}!!!'.upper()
If the function has a different version
from the cached value, the cached
value will be disregarded and a new value will be recomputed. Each subsequent
time we change this function, we just increment the version number.
- 1
Bionic detects changes by hashing all of the fixed entity values, and storing each computed value alongside a hash of all its inputs.
Automatic Versioning¶
New in version 0.5.0.
Note
This feature is somewhat experimental. However, if it proves useful, we may make assisted versioning the default behavior in the future.
By default, Bionic expects you to manually update a version decorator each time
you modify a function’s code. However, it can be configured to automatically
detect code changes and warn you if the code changes but the version doesn’t.
This “assisted versioning” behavior is enabled by changing Bionic’s versioning
mode from 'manual'
to 'assist'
:
builder.set('core__versioning_mode', 'assist')
In this mode, if Bionic finds a cached file created by a function with the
same version but different code 2, it will raise a
CodeVersioningError
. You can resolve this error by updating the
@version
, which tells Bionic to ignore the cached file
and compute a new value.
# Trying to compute this new version of ``message`` will throw an exception.
@builder
def message(greeting, subject):
return f'{greeting} {subject}!!!'.upper()
# With the version updated, Bionic knows to recompute this.
@builder
@bionic.version(1)
def message(greeting, subject):
return f'{greeting} {subject}!!!'.upper()
However, some code changes, such as refactoring or performance optimizations,
have no effect on the function’s behavior; in this case we might prefer to keep
using the cached value. If you’re confident that your change has no effect,
you can provide a minor
argument to @version
. Bionic only uses the
first argument (“major
”) for cache invalidation; updating the minor
argument tells Bionic to ignore the code differences and keep using any cached
file as long as the major
version matches.
# Even though we changed the code, Bionic won't recompute this.
@builder
@bionic.version(major=1, minor=1)
def message(greeting, subject):
return f'{greeting} {subject}!!!'.upper()
Be aware that Bionic can’t detect every change that can affect your code’s behavior. It only looks at the code of the decorated function itself; if you change any other function or library that your decorated function depends on, Bionic won’t notice. Similarly, if your function is wrapped by a non-Bionic decorator, Bionic won’t detect any code changes in that function at all. That’s why this mode only provides a warning, rather than automatically invalidating the cache for you: to keep you in the habit of thinking carefully about versioning.
However, if you do want to completely automate the versioning process, you can set Bionic to a “fully automatic” mode:
builder.set('core__versioning_mode', 'auto')
In this mode, Bionic will automatically invalidate cached files whenever a
function’s code changes, so you don’t need to set a @version
at all.
(However, you can still update the @version
to tell Bionic about external
changes that it can’t detect.) This mode is more dangerous, but can be useful
when your functions are small, change fast, and have few external dependencies
– for example, when your flow is defined in a notebook.
- 2
Bionic detects code changes by extracting and hashing the Python bytecode of each function decorated by a FlowBuilder.
Disabling Persistent Caching¶
In some cases, it doesn’t make sense to make a persistent copy of an entity’s value, either because the value is much cheaper to compute than to store, or because the value has a type that’s difficult to serialize. In these cases, we can disable persistent caching altogether:
@builder
@bionic.persist(False)
def message(subject):
return f'Hello {subject}.'
Disabling In-Memory Caching¶
In other cases, it might be useful to not keep an entity in memory, but store it on disk and load it only when needed for downstream computation. This might be useful when it is too expensive to keep all entities in memory. In these cases, we can disable in-memory caching:
@builder
@bionic.memoize(False)
def message(subject):
return f'Hello {subject}.'
Location of the Cache Directory¶
By default, Bionic persists cached values on the local disk, in a directory
called bndata/$NAME_OF_FLOW
. This can be configured by modifying one of
two internal entities:
builder = bionic.FlowBuilder('my_flow')
# Cache this flow's data in /my_cache_dir/my_flow/
builder.set('core__persistent_cache__global_dir', 'my_cache_dir')
# Cache this flow's data in /my_cache_dir/
builder.set('core__persistent_cache__flow_dir', 'my_cache_dir')
Caching in Google Cloud Storage¶
Bionic can be configured to cache to Google Cloud Storage as well as on the local filesystem:
builder = bionic.FlowBuilder('my_flow')
# You need to have an existing, accessible GCS bucket already.
builder.set('core__persistent_cache__gcs__bucket_name', 'my-bucket')
builder.set('core__persistent_cache__gcs__enabled', True)
By default, Bionic stores its cached files with a prefix of
$NAME_OF_USER/bndata/$NAME_OF_FLOW/
; this can be configured by setting the
core__persistent_cache__gcs__object_path
entity:
builder.set('core__persistent_cache__gcs__bucket_name', 'my-bucket')
builder.set('core__persistent_cache__gcs__object_path', 'my/path/')
builder.set('core__persistent_cache__gcs__enabled', True)
Alternatively, a single GCS URL can be provided:
builder.set('core__persistent_cache__gcs__url', 'gs://my-bucket/my/path/')
builder.set('core__persistent_cache__gcs__enabled', True)
Bionic will load data from the GCS cache whenever it’s not in the local cache, and will write back to both caches. Note that the upload time will make each entity computation a bit slower.
In order to use GCS caching, you must have the gsutil tool installed, and
you must have GCP credentials configured. You should also use pip install
bionic[gcp]
to install the required Python libraries.
Serialization Protocols¶
In order to persistently cache an entity’s value – which is a Python object – Bionic needs to serialize the value, converting it to a series of bytes which can be stored in a file. Conversely, to retrieve the value from the cache, those bytes need to be deserialized back into a Python object. The best way to serialize and deserialize a given value depends on its type.
Most Python objects can be serialized with Python’s built-in pickle module. However, for some
object types it’s more efficient or more idiomatic to use a different format.
There are also some types of objects that can’t be pickled at all. Bionic uses
pickle
by default, but handles some types specially: Pandas DataFrames are serialized in the Parquet format, while Pillow Images are serialized as PNGs. You can
explictly specify a serialization strategy for an entity by attaching a
Protocol to its definition.
Retrieving Persisted Files¶
In some cases, you’ll want to directly access the persisted file(s) for an
entity rather than its in-memory representation. (For example, if you’re
writing a paper or report, you may want to access the files containing the
plots.) This can be achieved with the mode
argument to
Flow.get
method. For example:
flow = builder.build().setting('subject', 'Alice')
flow.get('subject', mode='path')
This would return a Path
object for the subject
entity.
Multiplicity¶
So far we’ve only considered flows where each entity has a single value. However, often we want several instances of a particular part of our flow. To facilitate this, Bionic allows any entity to be assigned multiple values at once:
flow = builder.build()
flow2 = flow.setting('subject', values=['Alice', 'Bob'])
If an entity has multiple values, we have to tell Bionic that we expect a collection of values when we retrieve it:
# Returns `{'Alice', 'Bob'}`.
flow2.get('subject', 'set')
The “multiplicity” of the subject
entity is propagated to all downstream
entities as well:
# Returns `{'Hello Alice!', 'Hello Bob!'}`.
flow2.get('message', 'set')
This can also be used on multiple entities at once:
flow4 = flow2.setting('greeting', values=['Hello', 'Hi'])
# Returns `{'Hello Alice!', 'Hello Bob!', 'Hi Alice!', 'Hi Bob!}`.
flow4.get('message', 'set')
The multiplicity feature is illustrated in more detail later in the ML tutorial.
The Relational Model of Multiplicity¶
Bionic uses a relational model to determine how many instances of each entity to create. In essence, each entity has a “table” of values. For fixed entities, the values are provided explicitly by the user; for derived entities, they are constructed by a join-like operation on the entity’s dependencies’ tables.
For example, in the previous flow, we had two values of greeting
and two
values of subject
, producing four values of message
– one for each
combination. In other words, we took the Cartesian product of all possible inputs for
the message
entity.
However, Bionic will only combine values that are “compatible” with each other. For example:
builder.set('full_name', values=['Alice Adams', 'Bob Baker'])
@builder
def first_name(full_name):
return full_name.split()[0]
@builder
def last_name(full_name):
return full_name.split()[-1]
@builder
def reversed_name(first_name, last_name):
return f'{last_name}, {first_name}'
flow = builder.build()
# Returns `{'Adams, Alice', 'Baker, Bob'}`.
flow.get('reversed_name', 'set')
Even though reversed_name
depends on first_name
and last_name
, and
they each have two values, we don’t use every possible combination. Since
first_name
and last_name
share an ancestor, we only combine values
derived from the same ancestor value. "Alice"
and "Baker"
are derived
from different full_name
s, so they won’t be combined together.
Gathering¶
Often, if we have multiple instances of an entity, we eventually want to
aggregate those instances together and compare them somehow. This is the
function of the @gather
decorator.
Returning to the “hello world” example:
builder.set('greeting', values=['Hello', 'Hi'])
builder.set('subject', values=['Alice', 'Bob'])
# Returns `{'Hello Alice!', 'Hello Bob!', 'Hi Alice!', 'Hi Bob!}`.
builder.build().get('message', 'set')
@builder
@bn.gather(over='subject', also='message', into='gather_df')
def message_for_all_subjects(gather_df):
messages = gather_df.sort_values('subject')['message']
return ' '.join(messages)
# Return `{'Hello Alice! Hello Bob!', 'Hi Alice! Hi Bob!'}`
builder.build().get('message_for_all_subjects', 'set')
The effect of @gather
here is to “gather” together all the different
instances of subject
into a single dataframe, along with the associated
values of message
. Our message_for_all_subjects
function then combines
those messages together into a single message. The final result is an entity
with two distinct values.
Essentially, we create multiplicity with the values=
keyword, and we remove
it with the @gather
decorator. In this example, we created multiplicity
across two dimensions (greeting
and subject
), and then removed one
dimension (subject
), leaving one dimension remaining (greeting
).
Notice also that @gather
is treating the over
argument differently from
the also
argument; both are included in the dataframe, but only the former
affects the multiplicity of the resulting entity. (Incidentally, either of
these arguments can also accept a list of strings instead of a single string.)
This model of multiplicity takes some getting used to, but the payoff is that we only have to think about multiplicity in two places: where we create it, and where we remove it. Any intermediate entities are oblivious to how many times they’re instantiated. This quality is also demonstrated in the same tutorial section.
Case-by-Case Assignment¶
Normally, Bionic infers which entity values can be combined with others based on their ancestry. However, sometimes we want to explicitly specify which values are “compatible” with each other. In the situations, we can assign values by “case” instead of by entity.
builder.declare('color')
builder.declare('animal')
builder.add_case('color', 'black', 'animal', 'cat')
builder.add_case('color', 'brown', 'animal', 'cat')
builder.add_case('color', 'brown', 'animal', 'fox')
@builder
def colored_animal(color, animal):
return f'{color} {animal}'
# Returns `{'black cat', 'brown cat', 'brown fox'}`.
builder.build().get('colored_animal', 'set')
Other Features¶
Plotting¶
Bionic is based on a functional paradigm: the only important thing about a function is the value it returns, rather than any side effects it might have. However, some plotting libraries – most notably Matplotlib – don’t work like this. Instead, they maintain a global, stateful canvas which the user incrementally writes to and then visualizes.
Since plotting is a crucial part of data analysis, Bionic bridges this gap by
providing a @pyplot
decorator, which translates a
function using the Matplotlib API into a regular Bionic entity whose value is
a Pillow Image object.
@builder
@bn.pyplot('my_plt')
def my_plot(dataframe, my_plt):
my_plt.scatter(x=dataframe['time'], y=dataframe['profit'])
# Returns an Image object containing the plot.
builder.build().get('my_plot')
Logging¶
Bionic uses the built-in Python logging module to log what it does.
Currently it doesn’t attempt to configure any log handlers, since that’s
conventionally the responsibility of the application rather than a library.
This means that you will only see log messages that meet Python’s default
severity threshold: WARNING
and above. To see a running log of what Bionic
is computing, set the threshold to INFO
. You can do this with the
bionic.util.init_basic_logging
convenience function.
In the future, Bionic will probably have a configurable option to initialize the logging state itself. It will also provide an easy way for entity functions to access individually-named loggers, rather than having to create them themselves.
Reloading Flows in Notebooks¶
One of Bionic’s design goals is to make it easy for flows to be defined in Python module files but accessed in notebooks. However, one challenge is that when a module file is updated, the change is not reflected in the notebook – instead, the module has to be manually reloaded, and then the flow object has to be re-imported.
from my_module import flow
...
import my_module
reload(my_module)
from my_module import flow
flow.get('my_entity')
(Jupyter’s autoreload doesn’t work here, because after reloading we still need to re-import the flow.)
To address this, the Flow.reloading
method can
be used:
from my_module import flow
...
flow = flow.reloading()
flow.get('my_entity')
This attempts to reload all modules associated with the flow, and then return a re-imported version of the flow. (This is a fairly magical procedure – in complicated cases, it may not be able to figure out how to do this. In these cases it will try to throw an exception rather than fail silently.)
Combining Flows¶
When building a flow, you can import entities from another flow using the
merge
method:
builder.merge(flow)
This allows you to extend the functionality of a flow, or to combine multiple
flows into one. You can also combine two already-built flows using the
analogous merging
method.
If the two flows being merged have any entity names in common and Bionic can’t
figure out which one to keep, it will throw an exception. You can resolve the
conflict by using the keep
argument to specify which definitions to keep:
builder.merge(flow, keep='old')
Visualizing Flows¶
Bionic can visualize any flow as a directed acyclic graph, or “DAG”:
flow.render_dag()
Each entity in the flow is represented as a box, with arrows representing dependencies (the arrow points from the depended-on entity to the depending one). See the ML tutorial an example. This functionality requires the Graphviz library.