Introduction
Autonomio provides a high-level abstraction layer to building, configuring and optimizing neural networks and then using the trained models to make predictions in any environment. Unlike with other similar solutions, there is no need for signing up, API keys, cloud instances, or GPUs, and you have 100% control over the model. A typical installation takes a minute, and training a model not more than few minutes including data transformation from raw dataset with even thousands of columns, open text, and unstructured labels. Nothing is pre-trained, and only you have access to your data and predictions. There is no commercial entity behind Autonomio, but a non-profit research Foundation.
This document covers the functionality of Autonomio. If you're looking for a high level overview of the capabilities, you might find the Autonomio website more useful.
1-Minute Pipepline
To train a model use this code
# do the python imports
from autonomio.commands import data, wrangler, train, predictor
%matplotlib inline
# import the data from csv
df = data('medicare_10k.csv', mode='file', header=None)
# preprocess the data
df = wrangler(df,'z')
# train a neural net
train([2,17],'z',df,epoch=20,loss='logcosh',flatten='median')
NOTE: list of column index can be used with 3 or more columns. Using two integers will be considered a range of columns.
Autonomio is very easy to use and it's very easy to memorize the namespace which is just 4 commands and less than 40 arguments combined. Namespace memorization is one of the key differences between advanced and beginner users. Whereas Autonomio helps lower skill level practitioners to dramatically improve their capability, advanced practitioners enjoy significant productivity gains and headache reduction.
Installation
The simplest way is to get the latest well tested version is to install with pip from the repo directly. This way you get the latest well tested version, with the latest features.
pip install git+https://github.com/autonomio/core-module.git
Training Neural Network
A typical use of the training function for - you guessed it - training a neural network.
train([1,25],'Survived',df,
flatten='none',
epoch=250,
dropout=0,
batch_size=batch,
loss='logcosh',
activation='elu',
layers=layer,
shape=shape,
verbose=0)
Autonomio provides a very high level abstraction layer to several deep learning models:
- Multi Layer Perceptors (MLP)
- LSTM
- Regression
These are all accessed through the train() command.
Commands
Train
- loss
- optimization
- activation
- shape
- layers (even thousands of layers)
- dropout rate
- batch_size
Data Ingestion
Compared to TensorFlow, Keras, scikit learn and other common libraries, Autonomio provides a highly convinient data ingestion function.
- Automatically through train()
- Configured throgh train()
- Using the wrangler() utility
# a single column where data is string
train('text' ,'neg', data)
# a single column by index
train(5, 'neg', data)
# a single column by label
train(['quality_score'], 'neg', data)
# a range of column index
train([1,5], 'neg', data)
# set of column labels
train(['quality_score', 'reach_score'], 'neg', data)
# a list of column index
train([1,2,4,6,18], 'neg', data)
Data can be inputted from a dataframe, or csv, txt, json or msgpack files. All common transformations take place automatically within the train() command.
- automatic transformation of dependent variables
- from text to word vectors
- from text labels to integers
- from text to word vectors
- automatic transformation of outcome variable
- from continuous to categorical
- based on mean
- based on median
- based on quantiles
- based on ge
- from multi-category to binary
- string values
- numeric values
- from continuous to categorical
Generally speaking, multilayer percepton neural nets are strongest in solving classification problems, where the outcome variable is either binary categorical (0 or 1) or multi categorical. This is why there is strong emphasis in Autonomio on making such transformations available within the train() command.
BINARY (default)
- X can be text, int, or floating point
- Y can be an int, or floating point
The default settings are optimized for making a 1 or 0 prediction and for example in the case of predicting sentiment from tweets, Autonomio gives 85% accuracy without any parameter setting for classifying tweets that rank in the most negative 20% according to NLTK Vader sentiment analysis.
CATEGORICAL
- X can be text, integer
- Y can be an integer or text
- output layer neurons must match number of categories
- change activation_out to something that works with categoricals
It's not a good idea to have too many categories, maybe 10 is pushing it in most cases.
Train Query Parameters
ARGUMENT | REQUIRED INPUT | DEFAULT |
---|---|---|
X | string, int, float | NA |
Y | int,float,categorical | NA |
data | data object | NA |
epoch | int | 5 |
flatten | string, float | 'mean' |
dropout | float | .2 |
layers | int (2 through 5 | 3 |
loss | string [Keras_Losses]_ | 'binary_crossentropy' |
save_model | string | False |
neuron_first | int,float,categorical | 300 |
neuron_last | data object | 1 |
batch_size | int | 10 |
verbose | 0,1,2 | 0 |
shape | string | 'funnel' |
double_check | True or False | False |
validation | True,False,float(0 to 1) | False |
X = The input can be indicated in several ways::
'label' = single column label ['a','b'] = multiple column labels [1,12] = a range of columns [1,2,12] = columns by index The data can be multiple dtypes: 'int' = any integer values 'float' = any float value 'string' = raw text or category labels
In case you need to cleanup your data first, you can do it with::
from autonomio.commands import wrangler
wrangler(data,outcome_var)
Y = This can be in multiple dtype::
'int' = any integer values 'float' = any float value 'string' = category labels
See more related to prediction variable below in the 'flatten' section.
data = A pandas dataframe where you have at least one column for 'x' depedent variable (predictor) and one column for 'y' indepedent variable (prediction).
dims = This is selected automatically and is not needed to worry about. NOTE: this needs to be same as x features
epoch = how many epocs will be run for training. More epochs will take more time.
flatten = For transforming y (outcome) variable. For example if the y input is continuous but prediction is binary, then a flattening of some sort should be used.
OPTIONS: 'mean','median','mode', int, float, 'cat_string', 'cat_numeric', and 'none'
dropout = The fraction of learning that will be "forgotten" on each each learning event.
layers = The number of dense layers the model will have. Note that each dense layer is followed by a dropout layer.
model = This is currently not in use. Later we add LSTM and some other model options, then it will be activated.
loss = The loss to be used with the model. All the Keras losses all available https://keras.io/losses/
optimizer = The optimizer to use with the model. All the Keras optimizers are all available > https://keras.io/optimizers/
activation = Activation for the hidden layers (non-output) and all the Keras optimizers are all available > https://keras.io/optimizers/
activation_out = Same as 'activation' (above), but for the output layer only.
save_model = An option to save the model configuration, weights and parameters.
OPTIONS: default is 'False', if 'True' model will be saved with default name ('model') and if string, then the model name will be the string value e.g. 'titanic'.
neuron_max = The maximum number of neurons on any layer.
neuron_last = How many neurons there are in the last layer.
batch_size = Changes the number of samples that are propagated through the network at one given point in time. The smaller the batch_size, the longer the training will take.
verbose = This is set to '0' by default. The other options are '1' and '2' and will change the amount of information you are getting.
shape = Used for automatically creating a network shape. Currently there are 8 options available: 'funnel', 'rhombus', 'long_funnel', 'brick', 'hexagon', 'diamond', 'triangle', 'stairs'. Diagram is provided for each in the 'Shape' section.
double_check = Makes a 'manual' check of the results provided by Keras backend and compares the two. This is good when you have doubt with the results.
validation = Validates in a more robust way than usual train/test split by initially splitting the dataset in half, where the first half becomes train and test, and then the second half becomes validation data set.
OPTIONS: default is 'false', with 'true' 50% of data is separated for validation.
Predictor
predictor(data,'model.json')
Add labels to prections
test(,data,labels='handle','model.json')
Add an interactive scatter plot visualization with an y-axis variable::
test(,data,'handle','model.json',y_scatter='influence_score')
To yield the scatter plot, you have to call it specifically
test_result = test('text',data,'handle','model.json',y_scatter='influence_score')
test_result[1]
Once you've trained a model with train(), you can use it easily on any dataset through the predictor() command. You could use it in the Jypyter notebook, have it run on a server as part of some other process, or make it part of a website that does something interesting for the user based on their input. Just to name a few examples. Think of a trained neural net model as what is referred to as AI. It's far more easier to have AIs doing various tasks than most people think.
Test Query Parameters
ARGUMENT | REQUIRED INPUT | DEFAULT |
---|---|---|
X | variable/s in dataframe | NA |
data | pandas dataframe | NA |
labels | variable/s in dataframe | NA |
saved_model | filename | 5 |
y_scatter | variable in dataframe | 'mean' |
Wrangler
The wrangler() function introduces "best-of-class" data ingestion capability for maximum convinience of single file preparation. If you have to work with multiple files, handle each file separately and then merge afterwards. Based on the parameter configuration, wrangler() yields a dataframe where one or more of the following may be true:
from autonomio.commands import data, wrangler
df = data('train.csv','file')
titanic = wrangler(df,'Survived',starts_with_col='Cabin',first_fill_cols='Cabin')
NOTE: Typical kernel examples on Kaggle show that the same dataset require data scientist 30 to 100 lines of code in order to get to the exactly same result we get to here with a single wrangler() command.
- columns are dropped entirely
- rows are dropped
- unstructured columns are transformed in to categories
- unstructured columns are transformed in to word vectors (floats)
- NaN values are filled
data = A pandas dataframe that needs to be transformed.
y = The feature that will be moved as the first column in the dataframe and will not be transformed in anyway.
max_categories = Accepts an integer value. In columns with string values (automatically detected), if there are more unique values than 'max_categories', then the column will be not categorized and will be dropped instead. Such column could be treated with 'vectorize' parameter instead.
starts_with_col = Accepts a string value. For cases where a column of string values want to be transformed in to categories based on a shared first character in the string.
treshold = Accepts a floating point value (or 1). Sets the limit at which point a column will be entirely dropped because of two many NaN values. For example .6 means that if more than 60% of column's values are NaN, the whole column will be dropped.
first_fill_colls = Access a column name as value. For cases where a given column NaN values are filled first, so that it will not be dropped if it does not meet the 'treshold' parameter. This is for cases where some columns want to be retained even they have a high number of NaN values.
fill_with = A string, integer or float value. The value that is used for filling NaNs.
to_string = A column name. For the case where a given column may be needed later as a string value, for example a name to be connected with prediction values later.
vectorize = A column name. Vectorizes the text inputs in to 300 features, each representing a value in the word2vec vector.
ARGUMENT | REQUIRED INPUT | DEFAULT |
---|---|---|
data | pandas dataframe | NA |
y | the outcome variable | NA |
max_categories | max number of unique categories | 'auto' |
starts_with_col | a column in the dataframe | NA |
treshold | a % treshold of NaN values for dropping whole column | .9 |
first_fill_cols | a column in the dataframe | NA |
fill_with | a string, integer or float value | 0 |
to_string | a column in the dataframe | NA |
vectorize | a column in the dataframe with string values* | NA |
Hyperscan
The hyperscan() function is for scanning through hyperparameter configurations automatically or based on set ranges / lists. Starting a scan is as easy as it would be to run the train() command but instead of trainining a model with a single set of parameters, it does it with multiple configurations. For detailed overview of the parameters, see the section for train(). The below section will provide an overview of the parameters that are unique to hyperscan().
result = hyperscan([0,8],
'i',
diabetes,
epochs=150,
dropout=0,
scan_mode='selective',
losses='logcosh',
shapes=['brick','long_funnel'],
optimizers='rmsprop',
activations='softsign',
layers=[5,6],
batch_sizes=[14,20])
NOTE: Hyperscan is not a solution for optimization of hyperparameters, but a way to automate the most mindless part of model configuration. Currently there are six options for parameters to be scanned:
- number of layers
- shape of the NN
- batch_size
- activation
- optimizer
- loss
Each can be scanned in three modes:
- single value
- a list of values
- all values ('auto')
In addition 'batch_size' and 'layers' also support:
- a range of values
- a stepped range of values
For full reference, see the section for train() parameters.
batch_size_step = An integer. The number of values skipped, for example in a range of 2 to 20, 'batch_size_step' value 2 will skip 3,5,7,9...and so on.
layers_step = Same as 'batch_size_step' above but for layers.
scan_mode = If set on auto, all possible options will be scanned through. Note that this will take time, even with a powerful machine. In most cases it's better to use 'selective' with reasonable preset values in lists.
ARGUMENT | REQUIRED INPUT | DEFAULT |
---|---|---|
x | pandas dataframe | NA |
y | the outcome variable | NA |
data | max number of unique categories | NA |
flatten | a column in the dataframe | 'none' |
dropout | an float | 0 |
batch_sizes | a integer | 15 |
batch_sizes_step | an integer | 1 |
layers | an integer | 5 |
layers_step | an integer | 1 |
activation_out | single, list or 'auto' | 'sigmoid' |
neuron_max | an integer | 'auto' |
scan_mode | 'selective' or 'auto' | 'auto' |
losses | single, list or 'auto' | 'auto' |
optimizers | single, list or 'auto' | 'auto' |
activations | single, list or 'auto' | 'auto' |
shapes | single, list or 'auto' | 'auto' |
Data
The data() command is provided to allow data ingestion from a variety of formats, and to give the user access to unique deep learning datasets. In addition to allowing access to Autonomio datasets, the function also supports importing from csv, json, and excel. The data importing function is for most cases.
# loading 'random_tweets' dataset in to a dataframe
df = data('random_tweets')
# loading data.csv in to a dataframe
df = data('data.csv',mode='file')
Supported Formats
- csv
- txt
- json
- msgpack (highly compressed binary format)
Example datasets
Several unique deep learning focused datasets are provided with Autonomio. These datasets have not been released anywhere else, and relate to current affairs such as Twitter bots, ad fraud, US Election 2016, and party politics.
- election_in_twitter
- programmatic_ad_fraund
- parties_and_employment
- tweet_sentiment
- random_tweets
- sites_category_and_vec
Dataset consisting of 10 minute samples of 80 million tweets
data('election_in_twitter')
4,000 ad funded websites with word vectors and 5 categories
data('sites_category_and_vec')
Data from both buy and sell side and over 10 other sources
data('programmatic_ad_fraud')
9 years of monthly poll and unemployment numbers
data('parties_and_employment')
120,000 tweets with sentiment classification from NLTK
data('tweet_sentiment')
20,000 random tweets
data('random_tweets')
Query Parameters
ARGUMENT | REQUIRED INPUT | DEFAULT |
---|---|---|
name | dataset or filename | NA |
mode | string ('file') | 'default' |
sep | string e.g ' | ' |
delimiter | string e.g ',' | None |
header | string ('file') | 'infer' |
name = Name of the dataset or file. In the case of file, should be csv/txt for comma etc. separated values, json for json file and msgpack for msgpack. Automation of handling the request will not work unless the filename
mode = Either 'default' which implies one of the Autonomio datasets, or 'file' which is for loading a file.
sep = By default ',' but can be any string.
delimiter = This is used as secondary for separator (sep). Should be string, for example ',' when thousand separators are used.
header = Either integer for row number, 'None' for no header or default 'infer' will automatically decide (takes the top row mostly).
Examples
Autonomio is very easy to use and it's straightforward to memorize the namespace which is just 4 commands and less than 40 arguments combined. Namespace memorization is one of the key differences between advanced and beginner users. Whereas Autonomio helps lower skill level practitioners to dramatically improve their capability, advanced practitioners enjoy significant productivity gains and headache reduction.
Prepare and Train
A typical use-case, even with messy datasets with many columns, involves few lines of code and seconds or minutes of training time on a regular laptop machine.
Medicare Provider Utilization and Payment Data
# do the python imports
from autonomio.commands import data, wrangler, train, predictor
%matplotlib inline
# import the data from csv
df = data('medicare_10k.csv', mode='file', header=None)
df = wrangler(df,'z')
# train a neural net
train([2,17],'z',df,epoch=20,loss='logcosh',flatten='median')
Shapes
Shapes are used as part of the train() command, in order to dramatically change the network dimensions and shape with a single parameter. There are two parameters that work together to make up the shape and total neuron count of the neural network.
- shape
- neuron_max
Examples:
# produce a long_funnel where the highest neuron per layer is 10
train('text','neg',df,shape='long_funnel',neuron_max=10)
# produce a brick where the highest neuron per layer is 55
train('text','neg',df,shape='brick',neuron_max=55)
NOTE: Shapes function is called from within the train() and does not serve a meaningful purpose for using separately. The function outputs a list with the neuron counts.
Funnel
\ /
\ /
\ /
\ /
| |
Funnel is the shape, which is set by default. It roughly looks like an upside-dowm pyramind, so that the first layer is defined as neuron_max, and the next layers are sligtly decreased compared to previous ones.
As funnel shape is set by default, we do not need to input anything to use it.
Example input (default setting):
tr = train(1,'neg',temp,layers=5,neuron_max=10)
For a five layer neural net, this will yield 10, 5, 3, 2, 1 neurons respectively.
Long Funnel
| |
| |
| |
\ /
\ /
\ /
| |
Long Funnel shape can be applied by defining shape as 'long_funnel'. First half of the layers have the value of neuron_max, and then they have the shape similar to Funnel shape - decreasing to the last layer.
Example input:
tr = train(1,'neg',temp,layers=5,neuron_max=10)
For a six layer neural net, this will yield 10, 10, 10, 5, 3, 2 neurons respectively.
Rhombus
/ \
/ \
/ \
/ \
\ /
\ /
\ /
\ /
| |
Rhobmus can be called by definind shape as 'rhombus'. The first layer equals to 1 and the next layers slightly increase till the middle one which equals to the value of neuron_max. Next layers are the previous ones goin in the reversed order.
Example input:
train(1,'neg',temp,layers=5,neuron_max=10,shape='rhombus')
For a five layer neural net, this will yield 1, 6, 10, 6, 1 neurons respectively.
Diamond
/ \
/ \
\ /
\ /
\ /
\ /
| |
Defining shape as 'diamond' we will obtain the shape of the 'opened rhombus', where everything is similar to the Rhombus shape, but layers start from the larger number instead of 1.
Example input:
train(1,'neg',temp,layers=6,neuron_max=10,shape='diamond')
For a six layer neural net, this will yield 6, 6, 10, 5, 3, 2 neurons respectively.
Hexagon
/ \
/ \
/ \
| |
| |
| |
\ /
\ /
\ /
| |
Hexagon, which we get by calling 'hexagon' for shape, starts with 1 as the first layer and increases till the neuron_max value. Then some next layers will have maximum value untill it starts to decrease till the last layer.
Example input:
train(1,'neg',temp,layers=7,neuron_max=10,shape='hexagon')
Output list of neurons(excluding ounput layer).
For a seven layer neural net, this will yield 1, 3, 5, 10, 10, 5, 3 neurons respectively.
Brick
| |
| |
| |
| |
---- ----
| |
All the layers have neuron_max value. Called by shape='brick'.
Example input:
tr = train(1,'neg',temp,layers=5,neuron_max=10,shape='brick')
Output list of neurons(excluding ounput layer).
For a five layer neural net, this will yield 10, 10, 10, 10, 10 neurons respectively.
Triangle
/ \
/ \
/ \
/ \
/ \
---- ----
| |
This shape, which is called by defining shape as 'triangle' starts with 1 and increases till the last input layer, which is neuron_max.
Example input:
train(1,'neg',temp,layers=5,neuron_max=10,shape='triangle')
Output list of neurons(excluding ounput layer).
For a five layer neural net, this will yield 1, 2, 3, 5, 10 neurons respectively.
Stairs
| |
--- ---
| |
--- ---
| |
You can apply it defining shape as 'stairs'. If number of layers more than four, then each two layers will have the same value, then it decreases.If the number of layers is smaller than four, then the value decreases every single layer.
Example input:
train(1,'neg',temp,layers=6,neuron_max=10,shape='stairs')
For a six layer neural net, this will yield 10, 10, 8, 8, 6, 6 neurons respectively.
Language Processing
Unstructed Data
By some estimates, more than 90% of meaningful data is unstructured. Ingestion of unstructured data with Autonomio could not be easier; inputting unstructured data as 'x' is handled automatically whereas the input is converted in to word2vec word vectors. The way this works is roughly:
1) detect if a single column of x features is text 2) use spaCy NLP to vectorize the text 3) create 300 invididual features/columns from the vector 4) use the 300 features as signals for training the model
In addition to doing this automatically with train() having a single x column with text, when one or more columns of text needs to be vectorized as part of a dataset with other features, this can be done easily by using the 'vectorize' parameter in train().
Also the wrangler() data preparation function can be used to vectorize unstructured features (e.g. tweets or names).
Language support
Autonomio's vectorizing engine spaCy supports currently 13 languages:
- English
- German
- Chinese
- Spanish
- Italian
- French
- Portuguese
- Dutch
- Swedish
- Finnish
- Hungarian
- Bengali
- Hebrew
NOTE: the spacy language libraries have to be downloaded each separately.
Adding new languages
spaCy makes it reletively streamlined to create support for any language and the challenge can (and should be) approached iteratively.