Connect with us

Code

PostgresML – End-to-end machine learning system with SQL

PostgresML is an end-to-end machine learning system. It enables you to train models and make online predictions using only SQL, without your data ever leaving your favorite database.

Motivation behind PostgresML

Deploying machine learning models into existing applications is not straight forward. It involves operating new services, which need to be written in specialized languages with libraries outside of the experience of many software engineers. Those services tend to be architected around specialized datastores and hardware that requires additional management and know how. Data access needs to be secure across production and development environments without impeding productivity. This complexity pushes risks and costs beyond acceptable trade off limits for many otherwise valuable use cases.

PostgresML makes ML simple by moving the code to your data, rather than copying the data all over the place. You train models using simple SQL commands, and you get the predictions in your apps via a mechanism you’re already using: a query over a standard Postgres connection.

Our goal is that anyone with a basic understanding of SQL should be able to build, deploy and maintain machine learning models in production, while receiving the benefits of a high performance machine learning platform. Ultimately, PostgresML aims to be the easiest, safest and fastest way to gain value from machine learning.

PostgresML Features

Training models

Given a Postgres table or a view, PostgresML can train a model with many commonly used algorithms. We currently support the following regression and classification models from Scikit-Learn and XGBoost:

XGBoost

Algorithm Regression Classification
xgboost XGBRegressor XGBClassifier
xgboost_random_forest XGBRFRegressor XGBRFClassifier

Scikit Ensembles

Algorithm Regression Classification
ada_boost AdaBoostRegressor AdaBoostClassifier
bagging BaggingRegressor BaggingClassifier
extra_trees ExtraTreesRegressor ExtraTreesClassifier
gradient_boosting_trees GradientBoostingRegressor GradientBoostingClassifier
random_forest RandomForestRegressor RandomForestClassifier
hist_gradient_boosting HistGradientBoostingRegressor HistGradientBoostingClassifier

Support Vector Machines

Algorithm Regression Classification
svm SVR SVC
nu_svm NuSVR NuSVC
linear_svm LinearSVR LinearSVC

Linear Models

Algorithm Regression Classification
linear LinearRegression LogisticRegression
ridge Ridge RidgeClassifier
lasso Lasso
elastic_net ElasticNet
least_angle LARS
lasso_least_angle LassoLars
orthoganl_matching_pursuit OrthogonalMatchingPursuit
bayesian_ridge BayesianRidge
automatic_relevance_determination ARDRegression
stochastic_gradient_descent SGDRegressor SGDClassifier
perceptron Perceptron
passive_aggressive PassiveAggressiveRegressor PassiveAggressiveClassifier
ransac RANSACRegressor
theil_sen TheilSenRegressor
huber HuberRegressor
quantile QuantileRegressor

Other

Algorithm Regression Classification
kernel_ridge KernelRidge
gaussian_process GaussianProcessRegressor GaussianProcessClassifier

Training a model is then as simple as:

SELECT * FROM pgml.train(
    'Human-friendly project name',
    'regression', 
    '<name of the table or view containing the data>',
    '<name of the column containing the y target values>',
    '<algorithm name>', -- optional 
    '<algorithm hyperparams>' -- optional
);

PostgresML will snapshot the data from the table, train the model with the algorithm, and automatically deploy model improvements as measured by key performance metrics to make predictions in production.

Making predictions

Once the model is trained, making predictions is as simple as:

SELECT pgml.predict('Human-friendly project name', ARRAY[...]) AS prediction_score;

where ARRAY[...] is the same list of features for a sample used in training. This score then can be used in normal queries, for example:

SELECT *,
       pgml.predict(
          'Probability of buying our products',
          ARRAY[user.location, NOW() - user.created_at, user.total_purchases_in_dollars]
       ) AS likely_to_buy_score
FROM users
WHERE comapany_id = 5
ORDER BY likely_to_buy_score
LIMIT 25;

Take a look below for an example with real data.

Model and data versioning

As data in your database changes, it is possible to retrain the model again to get better predictions. With PostgresML, it’s as simple as running the pgml.train command again. If the model scores better, it will be automatically used in predictions; if not, the existing model will be kept and continue to score in your queries. There is also a deployment API if you need to manually manage which model is active. We also snapshot the training data, so models can be retrained deterministically to validate and fix any issues.

Deployments

Models are automatically deployed if their key metric (mean_squared_error for regression, f1 for classification) is improved over the currently deployed version during training. If you want to manage deploys manually, you can always change which model is currently responsible for making predictions with:

SELECT pgml.deploy(project_name TEXT, strategy TEXT DEFAULT 'best_score', algorithm_name TEXT DEFAULT NULL)

The default behavior allows any algorithm to qualify, but deployment candidates can be further restricted to a specific algorithm.

strategy description
most_recent The most recently trained model for this project
best_score The model that achieved the best key metric score
rollback The model that was previously deployed for this project
PostgresML Introspect at GitHub: https://github.com/siteline/SwiftUI-Introspect
Platform: Machine Learning
⭐️: 877
Advertisement

Trending