NAV Navbar
python
  • Introduction
  • 1-Minute Pipepline
  • Installation
  • Training Neural Network
  • Commands
  • Examples
  • Shapes
  • Language Processing
  • Introduction

    Autonomio provides a high-level abstraction layer to building, configuring and optimizing neural networks and then using the trained models to make predictions in any environment. Unlike with other similar solutions, there is no need for signing up, API keys, cloud instances, or GPUs, and you have 100% control over the model. A typical installation takes a minute, and training a model not more than few minutes including data transformation from raw dataset with even thousands of columns, open text, and unstructured labels. Nothing is pre-trained, and only you have access to your data and predictions. There is no commercial entity behind Autonomio, but a non-profit research Foundation.

    This document covers the functionality of Autonomio. If you're looking for a high level overview of the capabilities, you might find the Autonomio website more useful.

    1-Minute Pipepline

    To train a model use this code

    # do the python imports 
    from autonomio.commands import data, wrangler, train, predictor
    %matplotlib inline
    
    # import the data from csv
    df = data('medicare_10k.csv', mode='file', header=None)
    
    # preprocess the data
    df = wrangler(df,'z')
    
    # train a neural net
    train([2,17],'z',df,epoch=20,loss='logcosh',flatten='median')
    

    NOTE: list of column index can be used with 3 or more columns. Using two integers will be considered a range of columns.

    Autonomio is very easy to use and it's very easy to memorize the namespace which is just 4 commands and less than 40 arguments combined. Namespace memorization is one of the key differences between advanced and beginner users. Whereas Autonomio helps lower skill level practitioners to dramatically improve their capability, advanced practitioners enjoy significant productivity gains and headache reduction.

    Installation

    The simplest way is to get the latest well tested version is to install with pip from the repo directly. This way you get the latest well tested version, with the latest features.

    pip install git+https://github.com/autonomio/core-module.git

    Training Neural Network

    A typical use of the training function for - you guessed it - training a neural network.

    
    train([1,25],'Survived',df,
                            flatten='none',
                            epoch=250,
                            dropout=0,
                            batch_size=batch,
                            loss='logcosh',
                            activation='elu',
                            layers=layer,
                            shape=shape,
                            verbose=0)
    

    Autonomio provides a very high level abstraction layer to several deep learning models:

    These are all accessed through the train() command.

    Commands

    Train

    Data Ingestion

    Compared to TensorFlow, Keras, scikit learn and other common libraries, Autonomio provides a highly convinient data ingestion function.

    # a single column where data is string
    train('text' ,'neg', data) 
    
    # a single column by index
    train(5, 'neg', data) 
    
    # a single column by label
    train(['quality_score'], 'neg', data) 
    
    # a range of column index
    train([1,5], 'neg', data) 
    
    # set of column labels
    train(['quality_score', 'reach_score'], 'neg', data) 
    
    # a list of column index
    train([1,2,4,6,18], 'neg', data) 
    

    Data can be inputted from a dataframe, or csv, txt, json or msgpack files. All common transformations take place automatically within the train() command.

    Generally speaking, multilayer percepton neural nets are strongest in solving classification problems, where the outcome variable is either binary categorical (0 or 1) or multi categorical. This is why there is strong emphasis in Autonomio on making such transformations available within the train() command.

    BINARY (default)

    The default settings are optimized for making a 1 or 0 prediction and for example in the case of predicting sentiment from tweets, Autonomio gives 85% accuracy without any parameter setting for classifying tweets that rank in the most negative 20% according to NLTK Vader sentiment analysis.

    CATEGORICAL

    It's not a good idea to have too many categories, maybe 10 is pushing it in most cases.

    Train Query Parameters

    ARGUMENT REQUIRED INPUT DEFAULT
    X string, int, float NA
    Y int,float,categorical NA
    data data object NA
    epoch int 5
    flatten string, float 'mean'
    dropout float .2
    layers int (2 through 5 3
    loss string [Keras_Losses]_ 'binary_crossentropy'
    save_model string False
    neuron_first int,float,categorical 300
    neuron_last data object 1
    batch_size int 10
    verbose 0,1,2 0
    shape string 'funnel'
    double_check True or False False
    validation True,False,float(0 to 1) False

    X = The input can be indicated in several ways::

    'label' = single column label ['a','b'] = multiple column labels [1,12] = a range of columns [1,2,12] = columns by index The data can be multiple dtypes: 'int' = any integer values 'float' = any float value 'string' = raw text or category labels

    In case you need to cleanup your data first, you can do it with::

    from autonomio.commands import wrangler

    wrangler(data,outcome_var)

    Y = This can be in multiple dtype::

    'int' = any integer values 'float' = any float value 'string' = category labels

    See more related to prediction variable below in the 'flatten' section.

    data = A pandas dataframe where you have at least one column for 'x' depedent variable (predictor) and one column for 'y' indepedent variable (prediction).

    dims = This is selected automatically and is not needed to worry about. NOTE: this needs to be same as x features

    epoch = how many epocs will be run for training. More epochs will take more time.

    flatten = For transforming y (outcome) variable. For example if the y input is continuous but prediction is binary, then a flattening of some sort should be used.

    OPTIONS: 'mean','median','mode', int, float, 'cat_string', 'cat_numeric', and 'none'

    dropout = The fraction of learning that will be "forgotten" on each each learning event.

    layers = The number of dense layers the model will have. Note that each dense layer is followed by a dropout layer.

    model = This is currently not in use. Later we add LSTM and some other model options, then it will be activated.

    loss = The loss to be used with the model. All the Keras losses all available https://keras.io/losses/

    optimizer = The optimizer to use with the model. All the Keras optimizers are all available > https://keras.io/optimizers/

    activation = Activation for the hidden layers (non-output) and all the Keras optimizers are all available > https://keras.io/optimizers/

    activation_out = Same as 'activation' (above), but for the output layer only.

    save_model = An option to save the model configuration, weights and parameters.

    OPTIONS: default is 'False', if 'True' model will be saved with default name ('model') and if string, then the model name will be the string value e.g. 'titanic'.

    neuron_max = The maximum number of neurons on any layer.

    neuron_last = How many neurons there are in the last layer.

    batch_size = Changes the number of samples that are propagated through the network at one given point in time. The smaller the batch_size, the longer the training will take.

    verbose = This is set to '0' by default. The other options are '1' and '2' and will change the amount of information you are getting.

    shape = Used for automatically creating a network shape. Currently there are 8 options available: 'funnel', 'rhombus', 'long_funnel', 'brick', 'hexagon', 'diamond', 'triangle', 'stairs'. Diagram is provided for each in the 'Shape' section.

    double_check = Makes a 'manual' check of the results provided by Keras backend and compares the two. This is good when you have doubt with the results.

    validation = Validates in a more robust way than usual train/test split by initially splitting the dataset in half, where the first half becomes train and test, and then the second half becomes validation data set.

    OPTIONS: default is 'false', with 'true' 50% of data is separated for validation.

    Predictor

    predictor(data,'model.json')
    

    Add labels to prections

    test(,data,labels='handle','model.json')
    

    Add an interactive scatter plot visualization with an y-axis variable::

    test(,data,'handle','model.json',y_scatter='influence_score')
    

    To yield the scatter plot, you have to call it specifically

    test_result = test('text',data,'handle','model.json',y_scatter='influence_score')
    test_result[1]
    

    Once you've trained a model with train(), you can use it easily on any dataset through the predictor() command. You could use it in the Jypyter notebook, have it run on a server as part of some other process, or make it part of a website that does something interesting for the user based on their input. Just to name a few examples. Think of a trained neural net model as what is referred to as AI. It's far more easier to have AIs doing various tasks than most people think.

    Test Query Parameters

    ARGUMENT REQUIRED INPUT DEFAULT
    X variable/s in dataframe NA
    data pandas dataframe NA
    labels variable/s in dataframe NA
    saved_model filename 5
    y_scatter variable in dataframe 'mean'

    Wrangler

    The wrangler() function introduces "best-of-class" data ingestion capability for maximum convinience of single file preparation. If you have to work with multiple files, handle each file separately and then merge afterwards. Based on the parameter configuration, wrangler() yields a dataframe where one or more of the following may be true:

    
    from autonomio.commands import data, wrangler
    
    df = data('train.csv','file')
    titanic = wrangler(df,'Survived',starts_with_col='Cabin',first_fill_cols='Cabin')
    
    

    NOTE: Typical kernel examples on Kaggle show that the same dataset require data scientist 30 to 100 lines of code in order to get to the exactly same result we get to here with a single wrangler() command.

    data = A pandas dataframe that needs to be transformed.

    y = The feature that will be moved as the first column in the dataframe and will not be transformed in anyway.

    max_categories = Accepts an integer value. In columns with string values (automatically detected), if there are more unique values than 'max_categories', then the column will be not categorized and will be dropped instead. Such column could be treated with 'vectorize' parameter instead.

    starts_with_col = Accepts a string value. For cases where a column of string values want to be transformed in to categories based on a shared first character in the string.

    treshold = Accepts a floating point value (or 1). Sets the limit at which point a column will be entirely dropped because of two many NaN values. For example .6 means that if more than 60% of column's values are NaN, the whole column will be dropped.

    first_fill_colls = Access a column name as value. For cases where a given column NaN values are filled first, so that it will not be dropped if it does not meet the 'treshold' parameter. This is for cases where some columns want to be retained even they have a high number of NaN values.

    fill_with = A string, integer or float value. The value that is used for filling NaNs.

    to_string = A column name. For the case where a given column may be needed later as a string value, for example a name to be connected with prediction values later.

    vectorize = A column name. Vectorizes the text inputs in to 300 features, each representing a value in the word2vec vector.

    ARGUMENT REQUIRED INPUT DEFAULT
    data pandas dataframe NA
    y the outcome variable NA
    max_categories max number of unique categories 'auto'
    starts_with_col a column in the dataframe NA
    treshold a % treshold of NaN values for dropping whole column .9
    first_fill_cols a column in the dataframe NA
    fill_with a string, integer or float value 0
    to_string a column in the dataframe NA
    vectorize a column in the dataframe with string values* NA

    Hyperscan

    The hyperscan() function is for scanning through hyperparameter configurations automatically or based on set ranges / lists. Starting a scan is as easy as it would be to run the train() command but instead of trainining a model with a single set of parameters, it does it with multiple configurations. For detailed overview of the parameters, see the section for train(). The below section will provide an overview of the parameters that are unique to hyperscan().

    
    result = hyperscan([0,8], 
                       'i', 
                       diabetes,
                       epochs=150,
                       dropout=0,
                       scan_mode='selective', 
                       losses='logcosh',
                       shapes=['brick','long_funnel'], 
                       optimizers='rmsprop',
                       activations='softsign',
                       layers=[5,6],
                       batch_sizes=[14,20])
    
    

    NOTE: Hyperscan is not a solution for optimization of hyperparameters, but a way to automate the most mindless part of model configuration. Currently there are six options for parameters to be scanned:

    Each can be scanned in three modes:

    In addition 'batch_size' and 'layers' also support:

    For full reference, see the section for train() parameters.

    batch_size_step = An integer. The number of values skipped, for example in a range of 2 to 20, 'batch_size_step' value 2 will skip 3,5,7,9...and so on.

    layers_step = Same as 'batch_size_step' above but for layers.

    scan_mode = If set on auto, all possible options will be scanned through. Note that this will take time, even with a powerful machine. In most cases it's better to use 'selective' with reasonable preset values in lists.

    ARGUMENT REQUIRED INPUT DEFAULT
    x pandas dataframe NA
    y the outcome variable NA
    data max number of unique categories NA
    flatten a column in the dataframe 'none'
    dropout an float 0
    batch_sizes a integer 15
    batch_sizes_step an integer 1
    layers an integer 5
    layers_step an integer 1
    activation_out single, list or 'auto' 'sigmoid'
    neuron_max an integer 'auto'
    scan_mode 'selective' or 'auto' 'auto'
    losses single, list or 'auto' 'auto'
    optimizers single, list or 'auto' 'auto'
    activations single, list or 'auto' 'auto'
    shapes single, list or 'auto' 'auto'

    Data

    The data() command is provided to allow data ingestion from a variety of formats, and to give the user access to unique deep learning datasets. In addition to allowing access to Autonomio datasets, the function also supports importing from csv, json, and excel. The data importing function is for most cases.

    # loading 'random_tweets' dataset in to a dataframe
    df = data('random_tweets')
    
    # loading data.csv in to a dataframe
    df = data('data.csv',mode='file')
    

    Supported Formats

    Example datasets

    Several unique deep learning focused datasets are provided with Autonomio. These datasets have not been released anywhere else, and relate to current affairs such as Twitter bots, ad fraud, US Election 2016, and party politics.

    Dataset consisting of 10 minute samples of 80 million tweets

    data('election_in_twitter')
    

    4,000 ad funded websites with word vectors and 5 categories

    data('sites_category_and_vec')   
    

    Data from both buy and sell side and over 10 other sources

    data('programmatic_ad_fraud')    
    

    9 years of monthly poll and unemployment numbers

    data('parties_and_employment')   
    

    120,000 tweets with sentiment classification from NLTK

    data('tweet_sentiment')
    

    20,000 random tweets

    data('random_tweets')            
    

    Query Parameters

    ARGUMENT REQUIRED INPUT DEFAULT
    name dataset or filename NA
    mode string ('file') 'default'
    sep string e.g ' '
    delimiter string e.g ',' None
    header string ('file') 'infer'

    name = Name of the dataset or file. In the case of file, should be csv/txt for comma etc. separated values, json for json file and msgpack for msgpack. Automation of handling the request will not work unless the filename

    mode = Either 'default' which implies one of the Autonomio datasets, or 'file' which is for loading a file.

    sep = By default ',' but can be any string.

    delimiter = This is used as secondary for separator (sep). Should be string, for example ',' when thousand separators are used.

    header = Either integer for row number, 'None' for no header or default 'infer' will automatically decide (takes the top row mostly).

    Examples

    Autonomio is very easy to use and it's straightforward to memorize the namespace which is just 4 commands and less than 40 arguments combined. Namespace memorization is one of the key differences between advanced and beginner users. Whereas Autonomio helps lower skill level practitioners to dramatically improve their capability, advanced practitioners enjoy significant productivity gains and headache reduction.

    Prepare and Train

    A typical use-case, even with messy datasets with many columns, involves few lines of code and seconds or minutes of training time on a regular laptop machine.

    Medicare Provider Utilization and Payment Data

    
    # do the python imports 
    from autonomio.commands import data, wrangler, train, predictor
    %matplotlib inline
    
    # import the data from csv
    df = data('medicare_10k.csv', mode='file', header=None)
    
    df = wrangler(df,'z')
    
    # train a neural net
    train([2,17],'z',df,epoch=20,loss='logcosh',flatten='median')
    

    Shapes

    Shapes are used as part of the train() command, in order to dramatically change the network dimensions and shape with a single parameter. There are two parameters that work together to make up the shape and total neuron count of the neural network.

    Examples:

    # produce a long_funnel where the highest neuron per layer is 10 
    train('text','neg',df,shape='long_funnel',neuron_max=10)
    
    # produce a brick where the highest neuron per layer is 55 
    train('text','neg',df,shape='brick',neuron_max=55)
    
    

    NOTE: Shapes function is called from within the train() and does not serve a meaningful purpose for using separately. The function outputs a list with the neuron counts.

    Funnel

    \          /
     \        /
      \      /
       \    /
        |  |
    

    Funnel is the shape, which is set by default. It roughly looks like an upside-dowm pyramind, so that the first layer is defined as neuron_max, and the next layers are sligtly decreased compared to previous ones.

    As funnel shape is set by default, we do not need to input anything to use it.

    Example input (default setting):

    tr = train(1,'neg',temp,layers=5,neuron_max=10)
    

    For a five layer neural net, this will yield 10, 5, 3, 2, 1 neurons respectively.

    Long Funnel

     |          |
     |          |
     |          |
      \        /
       \      /
        \    /
         |  |
    

    Long Funnel shape can be applied by defining shape as 'long_funnel'. First half of the layers have the value of neuron_max, and then they have the shape similar to Funnel shape - decreasing to the last layer.

    Example input:

    tr = train(1,'neg',temp,layers=5,neuron_max=10)
    

    For a six layer neural net, this will yield 10, 10, 10, 5, 3, 2 neurons respectively.

    Rhombus

         /   \
        /     \
       /       \
      /         \
      \         /
       \       /
        \     /
         \   /
         |   |
    

    Rhobmus can be called by definind shape as 'rhombus'. The first layer equals to 1 and the next layers slightly increase till the middle one which equals to the value of neuron_max. Next layers are the previous ones goin in the reversed order.

    Example input:

    train(1,'neg',temp,layers=5,neuron_max=10,shape='rhombus')
    

    For a five layer neural net, this will yield 1, 6, 10, 6, 1 neurons respectively.

    Diamond

       /       \
      /         \
      \         /
       \       /
        \     /
         \   /
         |   |
    

    Defining shape as 'diamond' we will obtain the shape of the 'opened rhombus', where everything is similar to the Rhombus shape, but layers start from the larger number instead of 1.

    Example input:

    train(1,'neg',temp,layers=6,neuron_max=10,shape='diamond')
    

    For a six layer neural net, this will yield 6, 6, 10, 5, 3, 2 neurons respectively.

    Hexagon

        /    \
       /      \
      /        \
     |          |
     |          |
     |          |
      \        /
       \      /
        \    /
         |  |
    

    Hexagon, which we get by calling 'hexagon' for shape, starts with 1 as the first layer and increases till the neuron_max value. Then some next layers will have maximum value untill it starts to decrease till the last layer.

    Example input:

    train(1,'neg',temp,layers=7,neuron_max=10,shape='hexagon')
    

    Output list of neurons(excluding ounput layer).

    For a seven layer neural net, this will yield 1, 3, 5, 10, 10, 5, 3 neurons respectively.

    Brick

       |             |
       |             |
       |             |
       |             |
        ----     ----
            |   |
    
    

    All the layers have neuron_max value. Called by shape='brick'.

    Example input:

        tr = train(1,'neg',temp,layers=5,neuron_max=10,shape='brick')
    

    Output list of neurons(excluding ounput layer).

    For a five layer neural net, this will yield 10, 10, 10, 10, 10 neurons respectively.

    Triangle

            /    \
           /      \
          /        \
         /          \
        /            \
        ----      ----
            |    |
    

    This shape, which is called by defining shape as 'triangle' starts with 1 and increases till the last input layer, which is neuron_max.

    Example input:

    train(1,'neg',temp,layers=5,neuron_max=10,shape='triangle')
    

    Output list of neurons(excluding ounput layer).

    For a five layer neural net, this will yield 1, 2, 3, 5, 10 neurons respectively.

    Stairs

       |                      |
        ---                ---
           |             |
            ---       ---
               |     |
    

    You can apply it defining shape as 'stairs'. If number of layers more than four, then each two layers will have the same value, then it decreases.If the number of layers is smaller than four, then the value decreases every single layer.

    Example input:

    train(1,'neg',temp,layers=6,neuron_max=10,shape='stairs')
    

    For a six layer neural net, this will yield 10, 10, 8, 8, 6, 6 neurons respectively.

    Language Processing

    Unstructed Data

    By some estimates, more than 90% of meaningful data is unstructured. Ingestion of unstructured data with Autonomio could not be easier; inputting unstructured data as 'x' is handled automatically whereas the input is converted in to word2vec word vectors. The way this works is roughly:

    1) detect if a single column of x features is text 2) use spaCy NLP to vectorize the text 3) create 300 invididual features/columns from the vector 4) use the 300 features as signals for training the model

    In addition to doing this automatically with train() having a single x column with text, when one or more columns of text needs to be vectorized as part of a dataset with other features, this can be done easily by using the 'vectorize' parameter in train().

    Also the wrangler() data preparation function can be used to vectorize unstructured features (e.g. tweets or names).

    Language support

    Autonomio's vectorizing engine spaCy supports currently 13 languages:

    NOTE: the spacy language libraries have to be downloaded each separately.

    Read spaCy's language page

    Adding new languages

    spaCy makes it reletively streamlined to create support for any language and the challenge can (and should be) approached iteratively.