Introduction

Autonomio provides a high-level abstraction layer to building, configuring and optimizing neural networks and then using the trained models to make predictions in any environment. Unlike with other similar solutions, there is no need for signing up, API keys, cloud instances, or GPUs, and you have 100% control over the model. A typical installation takes a minute, and training a model not more than few minutes including data transformation from raw dataset with even thousands of columns, open text, and unstructured labels. Nothing is pre-trained, and only you have access to your data and predictions. There is no commercial entity behind Autonomio, but a non-profit research Foundation.

This document covers the functionality of Autonomio. If you're looking for a high level overview of the capabilities, you might find the Autonomio website more useful.

1-Minute Pipepline

To train a model use this code

# do the python imports 
from autonomio.commands import data, wrangler, train, predictor
%matplotlib inline

# import the data from csv
df = data('medicare_10k.csv', mode='file', header=None)

# preprocess the data
df = wrangler(df,'z')

# train a neural net
train([2,17],'z',df,epoch=20,loss='logcosh',flatten='median')

NOTE: list of column index can be used with 3 or more columns. Using two integers will be considered a range of columns.

Autonomio is very easy to use and it's very easy to memorize the namespace which is just 4 commands and less than 40 arguments combined. Namespace memorization is one of the key differences between advanced and beginner users. Whereas Autonomio helps lower skill level practitioners to dramatically improve their capability, advanced practitioners enjoy significant productivity gains and headache reduction.

You must replace medicare_10k.csv with your own dataset.

Installation

The simplest way is to get the latest well tested version is to install with pip from the repo directly. This way you get the latest well tested version, with the latest features.

pip install git+https://github.com/autonomio/core-module.git

Training Neural Network

A typical use of the training function for - you guessed it - training a neural network.


train([1,25],'Survived',df,
                        flatten='none',
                        epoch=250,
                        dropout=0,
                        batch_size=batch,
                        loss='logcosh',
                        activation='elu',
                        layers=layer,
                        shape=shape,
                        verbose=0)

Autonomio provides a very high level abstraction layer to several deep learning models:

Multi Layer Perceptors (MLP)
LSTM
Regression

These are all accessed through the train() command.

Commands

Train

loss
optimization
activation
shape
layers (even thousands of layers)
dropout rate
batch_size

Data Ingestion

Compared to TensorFlow, Keras, scikit learn and other common libraries, Autonomio provides a highly convinient data ingestion function.

Automatically through train()
Configured throgh train()
Using the wrangler() utility

# a single column where data is string
train('text' ,'neg', data) 

# a single column by index
train(5, 'neg', data) 

# a single column by label
train(['quality_score'], 'neg', data) 

# a range of column index
train([1,5], 'neg', data) 

# set of column labels
train(['quality_score', 'reach_score'], 'neg', data) 

# a list of column index
train([1,2,4,6,18], 'neg', data)

Data can be inputted from a dataframe, or csv, txt, json or msgpack files. All common transformations take place automatically within the train() command.

automatic transformation of dependent variables
- from text to word vectors
  - from text labels to integers
automatic transformation of outcome variable
- from continuous to categorical
  - based on mean
  - based on median
  - based on quantiles
  - based on ge
- from multi-category to binary
  - string values
  - numeric values

Generally speaking, multilayer percepton neural nets are strongest in solving classification problems, where the outcome variable is either binary categorical (0 or 1) or multi categorical. This is why there is strong emphasis in Autonomio on making such transformations available within the train() command.

BINARY (default)

X can be text, int, or floating point
Y can be an int, or floating point

The default settings are optimized for making a 1 or 0 prediction and for example in the case of predicting sentiment from tweets, Autonomio gives 85% accuracy without any parameter setting for classifying tweets that rank in the most negative 20% according to NLTK Vader sentiment analysis.

CATEGORICAL

X can be text, integer
Y can be an integer or text
output layer neurons must match number of categories
change activation_out to something that works with categoricals

It's not a good idea to have too many categories, maybe 10 is pushing it in most cases.

Train Query Parameters

ARGUMENT	REQUIRED INPUT	DEFAULT
X	string, int, float	NA
Y	int,float,categorical	NA
data	data object	NA
epoch	int	5
flatten	string, float	'mean'
dropout	float	.2
layers	int (2 through 5	3
loss	string [Keras_Losses]_	'binary_crossentropy'
save_model	string	False
neuron_first	int,float,categorical	300
neuron_last	data object	1
batch_size	int	10
verbose	0,1,2	0
shape	string	'funnel'
double_check	True or False	False
validation	True,False,float(0 to 1)	False

X = The input can be indicated in several ways::

'label' = single column label ['a','b'] = multiple column labels [1,12] = a range of columns [1,2,12] = columns by index The data can be multiple dtypes: 'int' = any integer values 'float' = any float value 'string' = raw text or category labels

In case you need to cleanup your data first, you can do it with::

from autonomio.commands import wrangler

wrangler(data,outcome_var)

Y = This can be in multiple dtype::

'int' = any integer values 'float' = any float value 'string' = category labels

See more related to prediction variable below in the 'flatten' section.

data = A pandas dataframe where you have at least one column for 'x' depedent variable (predictor) and one column for 'y' indepedent variable (prediction).

dims = This is selected automatically and is not needed to worry about. NOTE: this needs to be same as x features

epoch = how many epocs will be run for training. More epochs will take more time.

flatten = For transforming y (outcome) variable. For example if the y input is continuous but prediction is binary, then a flattening of some sort should be used.

OPTIONS: 'mean','median','mode', int, float, 'cat_string', 'cat_numeric', and 'none'

dropout = The fraction of learning that will be "forgotten" on each each learning event.

layers = The number of dense layers the model will have. Note that each dense layer is followed by a dropout layer.

model = This is currently not in use. Later we add LSTM and some other model options, then it will be activated.

loss = The loss to be used with the model. All the Keras losses all available https://keras.io/losses/

optimizer = The optimizer to use with the model. All the Keras optimizers are all available > https://keras.io/optimizers/

activation = Activation for the hidden layers (non-output) and all the Keras optimizers are all available > https://keras.io/optimizers/

activation_out = Same as 'activation' (above), but for the output layer only.

save_model = An option to save the model configuration, weights and parameters.

OPTIONS: default is 'False', if 'True' model will be saved with default name ('model') and if string, then the model name will be the string value e.g. 'titanic'.

neuron_max = The maximum number of neurons on any layer.

neuron_last = How many neurons there are in the last layer.

batch_size = Changes the number of samples that are propagated through the network at one given point in time. The smaller the batch_size, the longer the training will take.

verbose = This is set to '0' by default. The other options are '1' and '2' and will change the amount of information you are getting.

shape = Used for automatically creating a network shape. Currently there are 8 options available: 'funnel', 'rhombus', 'long_funnel', 'brick', 'hexagon', 'diamond', 'triangle', 'stairs'. Diagram is provided for each in the 'Shape' section.

double_check = Makes a 'manual' check of the results provided by Keras backend and compares the two. This is good when you have doubt with the results.

validation = Validates in a more robust way than usual train/test split by initially splitting the dataset in half, where the first half becomes train and test, and then the second half becomes validation data set.

OPTIONS: default is 'false', with 'true' 50% of data is separated for validation.

Predictor

predictor(data,'model.json')

Add labels to prections

test(,data,labels='handle','model.json')

Add an interactive scatter plot visualization with an y-axis variable::

test(,data,'handle','model.json',y_scatter='influence_score')

To yield the scatter plot, you have to call it specifically

test_result = test('text',data,'handle','model.json',y_scatter='influence_score')
test_result[1]

Once you've trained a model with train(), you can use it easily on any dataset through the predictor() command. You could use it in the Jypyter notebook, have it run on a server as part of some other process, or make it part of a website that does something interesting for the user based on their input. Just to name a few examples. Think of a trained neural net model as what is referred to as AI. It's far more easier to have AIs doing various tasks than most people think.

Test Query Parameters

ARGUMENT	REQUIRED INPUT	DEFAULT
X	variable/s in dataframe	NA
data	pandas dataframe	NA
labels	variable/s in dataframe	NA
saved_model	filename	5
y_scatter	variable in dataframe	'mean'

Wrangler

The wrangler() function introduces "best-of-class" data ingestion capability for maximum convinience of single file preparation. If you have to work with multiple files, handle each file separately and then merge afterwards. Based on the parameter configuration, wrangler() yields a dataframe where one or more of the following may be true:


from autonomio.commands import data, wrangler

df = data('train.csv','file')
titanic = wrangler(df,'Survived',starts_with_col='Cabin',first_fill_cols='Cabin')

NOTE: Typical kernel examples on Kaggle show that the same dataset require data scientist 30 to 100 lines of code in order to get to the exactly same result we get to here with a single wrangler() command.

columns are dropped entirely
rows are dropped
unstructured columns are transformed in to categories
unstructured columns are transformed in to word vectors (floats)
NaN values are filled

data = A pandas dataframe that needs to be transformed.

y = The feature that will be moved as the first column in the dataframe and will not be transformed in anyway.

max_categories = Accepts an integer value. In columns with string values (automatically detected), if there are more unique values than 'max_categories', then the column will be not categorized and will be dropped instead. Such column could be treated with 'vectorize' parameter instead.

starts_with_col = Accepts a string value. For cases where a column of string values want to be transformed in to categories based on a shared first character in the string.

treshold = Accepts a floating point value (or 1). Sets the limit at which point a column will be entirely dropped because of two many NaN values. For example .6 means that if more than 60% of column's values are NaN, the whole column will be dropped.

first_fill_colls = Access a column name as value. For cases where a given column NaN values are filled first, so that it will not be dropped if it does not meet the 'treshold' parameter. This is for cases where some columns want to be retained even they have a high number of NaN values.

fill_with = A string, integer or float value. The value that is used for filling NaNs.

to_string = A column name. For the case where a given column may be needed later as a string value, for example a name to be connected with prediction values later.

vectorize = A column name. Vectorizes the text inputs in to 300 features, each representing a value in the word2vec vector.

ARGUMENT	REQUIRED INPUT	DEFAULT
data	pandas dataframe	NA
y	the outcome variable	NA
max_categories	max number of unique categories	'auto'
starts_with_col	a column in the dataframe	NA
treshold	a % treshold of NaN values for dropping whole column	.9
first_fill_cols	a column in the dataframe	NA
fill_with	a string, integer or float value	0
to_string	a column in the dataframe	NA
vectorize	a column in the dataframe with string values*	NA

Hyperscan

The hyperscan() function is for scanning through hyperparameter configurations automatically or based on set ranges / lists. Starting a scan is as easy as it would be to run the train() command but instead of trainining a model with a single set of parameters, it does it with multiple configurations. For detailed overview of the parameters, see the section for train(). The below section will provide an overview of the parameters that are unique to hyperscan().


result = hyperscan([0,8], 
                   'i', 
                   diabetes,
                   epochs=150,
                   dropout=0,
                   scan_mode='selective', 
                   losses='logcosh',
                   shapes=['brick','long_funnel'], 
                   optimizers='rmsprop',
                   activations='softsign',
                   layers=[5,6],
                   batch_sizes=[14,20])

NOTE: Hyperscan is not a solution for optimization of hyperparameters, but a way to automate the most mindless part of model configuration. Currently there are six options for parameters to be scanned:

number of layers
shape of the NN
batch_size
activation
optimizer
loss

Each can be scanned in three modes:

single value
a list of values
all values ('auto')

In addition 'batch_size' and 'layers' also support:

a range of values
a stepped range of values

For full reference, see the section for train() parameters.

batch_size_step = An integer. The number of values skipped, for example in a range of 2 to 20, 'batch_size_step' value 2 will skip 3,5,7,9...and so on.

layers_step = Same as 'batch_size_step' above but for layers.

scan_mode = If set on auto, all possible options will be scanned through. Note that this will take time, even with a powerful machine. In most cases it's better to use 'selective' with reasonable preset values in lists.

ARGUMENT	REQUIRED INPUT	DEFAULT
x	pandas dataframe	NA
y	the outcome variable	NA
data	max number of unique categories	NA
flatten	a column in the dataframe	'none'
dropout	an float	0
batch_sizes	a integer	15
batch_sizes_step	an integer	1
layers	an integer	5
layers_step	an integer	1
activation_out	single, list or 'auto'	'sigmoid'
neuron_max	an integer	'auto'
scan_mode	'selective' or 'auto'	'auto'
losses	single, list or 'auto'	'auto'
optimizers	single, list or 'auto'	'auto'
activations	single, list or 'auto'	'auto'
shapes	single, list or 'auto'	'auto'

Data

The data() command is provided to allow data ingestion from a variety of formats, and to give the user access to unique deep learning datasets. In addition to allowing access to Autonomio datasets, the function also supports importing from csv, json, and excel. The data importing function is for most cases.

# loading 'random_tweets' dataset in to a dataframe
df = data('random_tweets')

# loading data.csv in to a dataframe
df = data('data.csv',mode='file')

Supported Formats

csv
txt
json
msgpack (highly compressed binary format)

Example datasets

Several unique deep learning focused datasets are provided with Autonomio. These datasets have not been released anywhere else, and relate to current affairs such as Twitter bots, ad fraud, US Election 2016, and party politics.

election_in_twitter
programmatic_ad_fraund
parties_and_employment
tweet_sentiment
random_tweets
sites_category_and_vec

Dataset consisting of 10 minute samples of 80 million tweets

data('election_in_twitter')

4,000 ad funded websites with word vectors and 5 categories

data('sites_category_and_vec')

Data from both buy and sell side and over 10 other sources

data('programmatic_ad_fraud')

9 years of monthly poll and unemployment numbers

data('parties_and_employment')

120,000 tweets with sentiment classification from NLTK

data('tweet_sentiment')

20,000 random tweets

data('random_tweets')

Query Parameters

ARGUMENT	REQUIRED INPUT	DEFAULT
name	dataset or filename	NA
mode	string ('file')	'default'
sep	string e.g '	'
delimiter	string e.g ','	None
header	string ('file')	'infer'

name = Name of the dataset or file. In the case of file, should be csv/txt for comma etc. separated values, json for json file and msgpack for msgpack. Automation of handling the request will not work unless the filename

mode = Either 'default' which implies one of the Autonomio datasets, or 'file' which is for loading a file.

sep = By default ',' but can be any string.

delimiter = This is used as secondary for separator (sep). Should be string, for example ',' when thousand separators are used.

header = Either integer for row number, 'None' for no header or default 'infer' will automatically decide (takes the top row mostly).

Examples

Autonomio is very easy to use and it's straightforward to memorize the namespace which is just 4 commands and less than 40 arguments combined. Namespace memorization is one of the key differences between advanced and beginner users. Whereas Autonomio helps lower skill level practitioners to dramatically improve their capability, advanced practitioners enjoy significant productivity gains and headache reduction.

Prepare and Train

A typical use-case, even with messy datasets with many columns, involves few lines of code and seconds or minutes of training time on a regular laptop machine.

Medicare Provider Utilization and Payment Data


# do the python imports 
from autonomio.commands import data, wrangler, train, predictor
%matplotlib inline

# import the data from csv
df = data('medicare_10k.csv', mode='file', header=None)

df = wrangler(df,'z')

# train a neural net
train([2,17],'z',df,epoch=20,loss='logcosh',flatten='median')

Shapes

Shapes are used as part of the train() command, in order to dramatically change the network dimensions and shape with a single parameter. There are two parameters that work together to make up the shape and total neuron count of the neural network.

shape
neuron_max

Examples:

# produce a long_funnel where the highest neuron per layer is 10 
train('text','neg',df,shape='long_funnel',neuron_max=10)

# produce a brick where the highest neuron per layer is 55 
train('text','neg',df,shape='brick',neuron_max=55)

NOTE: Shapes function is called from within the train() and does not serve a meaningful purpose for using separately. The function outputs a list with the neuron counts.

Funnel

\          /
 \        /
  \      /
   \    /
    |  |

Funnel is the shape, which is set by default. It roughly looks like an upside-dowm pyramind, so that the first layer is defined as neuron_max, and the next layers are sligtly decreased compared to previous ones.

As funnel shape is set by default, we do not need to input anything to use it.

Example input (default setting):

tr = train(1,'neg',temp,layers=5,neuron_max=10)

For a five layer neural net, this will yield 10, 5, 3, 2, 1 neurons respectively.

Long Funnel

 |          |
 |          |
 |          |
  \        /
   \      /
    \    /
     |  |

Long Funnel shape can be applied by defining shape as 'long_funnel'. First half of the layers have the value of neuron_max, and then they have the shape similar to Funnel shape - decreasing to the last layer.

Example input:

tr = train(1,'neg',temp,layers=5,neuron_max=10)

For a six layer neural net, this will yield 10, 10, 10, 5, 3, 2 neurons respectively.

Rhombus

     /   \
    /     \
   /       \
  /         \
  \         /
   \       /
    \     /
     \   /
     |   |

Rhobmus can be called by definind shape as 'rhombus'. The first layer equals to 1 and the next layers slightly increase till the middle one which equals to the value of neuron_max. Next layers are the previous ones goin in the reversed order.

Example input:

train(1,'neg',temp,layers=5,neuron_max=10,shape='rhombus')

For a five layer neural net, this will yield 1, 6, 10, 6, 1 neurons respectively.

Diamond

   /       \
  /         \
  \         /
   \       /
    \     /
     \   /
     |   |

Defining shape as 'diamond' we will obtain the shape of the 'opened rhombus', where everything is similar to the Rhombus shape, but layers start from the larger number instead of 1.

Example input:

train(1,'neg',temp,layers=6,neuron_max=10,shape='diamond')

For a six layer neural net, this will yield 6, 6, 10, 5, 3, 2 neurons respectively.

Hexagon

    /    \
   /      \
  /        \
 |          |
 |          |
 |          |
  \        /
   \      /
    \    /
     |  |

Hexagon, which we get by calling 'hexagon' for shape, starts with 1 as the first layer and increases till the neuron_max value. Then some next layers will have maximum value untill it starts to decrease till the last layer.

Example input:

train(1,'neg',temp,layers=7,neuron_max=10,shape='hexagon')

Output list of neurons(excluding ounput layer).

For a seven layer neural net, this will yield 1, 3, 5, 10, 10, 5, 3 neurons respectively.

Brick

   |             |
   |             |
   |             |
   |             |
    ----     ----
        |   |

All the layers have neuron_max value. Called by shape='brick'.

Example input:

    tr = train(1,'neg',temp,layers=5,neuron_max=10,shape='brick')

Output list of neurons(excluding ounput layer).

For a five layer neural net, this will yield 10, 10, 10, 10, 10 neurons respectively.

Triangle

        /    \
       /      \
      /        \
     /          \
    /            \
    ----      ----
        |    |

This shape, which is called by defining shape as 'triangle' starts with 1 and increases till the last input layer, which is neuron_max.

Example input:

train(1,'neg',temp,layers=5,neuron_max=10,shape='triangle')

Output list of neurons(excluding ounput layer).

For a five layer neural net, this will yield 1, 2, 3, 5, 10 neurons respectively.

Stairs

   |                      |
    ---                ---
       |             |
        ---       ---
           |     |

You can apply it defining shape as 'stairs'. If number of layers more than four, then each two layers will have the same value, then it decreases.If the number of layers is smaller than four, then the value decreases every single layer.

Example input:

train(1,'neg',temp,layers=6,neuron_max=10,shape='stairs')

For a six layer neural net, this will yield 10, 10, 8, 8, 6, 6 neurons respectively.

Language Processing

Unstructed Data

By some estimates, more than 90% of meaningful data is unstructured. Ingestion of unstructured data with Autonomio could not be easier; inputting unstructured data as 'x' is handled automatically whereas the input is converted in to word2vec word vectors. The way this works is roughly:

1) detect if a single column of x features is text 2) use spaCy NLP to vectorize the text 3) create 300 invididual features/columns from the vector 4) use the 300 features as signals for training the model

In addition to doing this automatically with train() having a single x column with text, when one or more columns of text needs to be vectorized as part of a dataset with other features, this can be done easily by using the 'vectorize' parameter in train().

Also the wrangler() data preparation function can be used to vectorize unstructured features (e.g. tweets or names).

Language support

Autonomio's vectorizing engine spaCy supports currently 13 languages:

English
German
Chinese
Spanish
Italian
French
Portuguese
Dutch
Swedish
Finnish
Hungarian
Bengali
Hebrew

NOTE: the spacy language libraries have to be downloaded each separately.

Read spaCy's language page

Adding new languages

spaCy makes it reletively streamlined to create support for any language and the challenge can (and should be) approached iteratively.