Code
PostgresML – End-to-end machine learning system with SQL

PostgresML is an end-to-end machine learning system. It enables you to train models and make online predictions using only SQL, without your data ever leaving your favorite database.
Motivation behind PostgresML
Deploying machine learning models into existing applications is not straight forward. It involves operating new services, which need to be written in specialized languages with libraries outside of the experience of many software engineers. Those services tend to be architected around specialized datastores and hardware that requires additional management and know how. Data access needs to be secure across production and development environments without impeding productivity. This complexity pushes risks and costs beyond acceptable trade off limits for many otherwise valuable use cases.
PostgresML makes ML simple by moving the code to your data, rather than copying the data all over the place. You train models using simple SQL commands, and you get the predictions in your apps via a mechanism you’re already using: a query over a standard Postgres connection.
Our goal is that anyone with a basic understanding of SQL should be able to build, deploy and maintain machine learning models in production, while receiving the benefits of a high performance machine learning platform. Ultimately, PostgresML aims to be the easiest, safest and fastest way to gain value from machine learning.
PostgresML Features
Training models
Given a Postgres table or a view, PostgresML can train a model with many commonly used algorithms. We currently support the following regression and classification models from Scikit-Learn and XGBoost:
XGBoost
Algorithm | Regression | Classification |
---|---|---|
xgboost |
XGBRegressor | XGBClassifier |
xgboost_random_forest |
XGBRFRegressor | XGBRFClassifier |
Scikit Ensembles
Algorithm | Regression | Classification |
---|---|---|
ada_boost |
AdaBoostRegressor | AdaBoostClassifier |
bagging |
BaggingRegressor | BaggingClassifier |
extra_trees |
ExtraTreesRegressor | ExtraTreesClassifier |
gradient_boosting_trees |
GradientBoostingRegressor | GradientBoostingClassifier |
random_forest |
RandomForestRegressor | RandomForestClassifier |
hist_gradient_boosting |
HistGradientBoostingRegressor | HistGradientBoostingClassifier |
Support Vector Machines
Algorithm | Regression | Classification |
---|---|---|
svm |
SVR | SVC |
nu_svm |
NuSVR | NuSVC |
linear_svm |
LinearSVR | LinearSVC |
Linear Models
Algorithm | Regression | Classification |
---|---|---|
linear |
LinearRegression | LogisticRegression |
ridge |
Ridge | RidgeClassifier |
lasso |
Lasso | – |
elastic_net |
ElasticNet | – |
least_angle |
LARS | – |
lasso_least_angle |
LassoLars | – |
orthoganl_matching_pursuit |
OrthogonalMatchingPursuit | – |
bayesian_ridge |
BayesianRidge | – |
automatic_relevance_determination |
ARDRegression | – |
stochastic_gradient_descent |
SGDRegressor | SGDClassifier |
perceptron |
– | Perceptron |
passive_aggressive |
PassiveAggressiveRegressor | PassiveAggressiveClassifier |
ransac |
RANSACRegressor | – |
theil_sen |
TheilSenRegressor | – |
huber |
HuberRegressor | – |
quantile |
QuantileRegressor | – |
Other
Algorithm | Regression | Classification |
---|---|---|
kernel_ridge |
KernelRidge | – |
gaussian_process |
GaussianProcessRegressor | GaussianProcessClassifier |
Training a model is then as simple as:
SELECT * FROM pgml.train(
'Human-friendly project name',
'regression',
'<name of the table or view containing the data>',
'<name of the column containing the y target values>',
'<algorithm name>', -- optional
'<algorithm hyperparams>' -- optional
);
PostgresML will snapshot the data from the table, train the model with the algorithm, and automatically deploy model improvements as measured by key performance metrics to make predictions in production.
Making predictions
Once the model is trained, making predictions is as simple as:
SELECT pgml.predict('Human-friendly project name', ARRAY[...]) AS prediction_score;
where ARRAY[...]
is the same list of features for a sample used in training. This score then can be used in normal queries, for example:
SELECT *,
pgml.predict(
'Probability of buying our products',
ARRAY[user.location, NOW() - user.created_at, user.total_purchases_in_dollars]
) AS likely_to_buy_score
FROM users
WHERE comapany_id = 5
ORDER BY likely_to_buy_score
LIMIT 25;
Take a look below for an example with real data.
Model and data versioning
As data in your database changes, it is possible to retrain the model again to get better predictions. With PostgresML, it’s as simple as running the pgml.train
command again. If the model scores better, it will be automatically used in predictions; if not, the existing model will be kept and continue to score in your queries. There is also a deployment API if you need to manually manage which model is active. We also snapshot the training data, so models can be retrained deterministically to validate and fix any issues.
Deployments
Models are automatically deployed if their key metric (mean_squared_error
for regression, f1
for classification) is improved over the currently deployed version during training. If you want to manage deploys manually, you can always change which model is currently responsible for making predictions with:
SELECT pgml.deploy(project_name TEXT, strategy TEXT DEFAULT 'best_score', algorithm_name TEXT DEFAULT NULL)
The default behavior allows any algorithm to qualify, but deployment candidates can be further restricted to a specific algorithm.
strategy | description |
---|---|
most_recent | The most recently trained model for this project |
best_score | The model that achieved the best key metric score |
rollback | The model that was previously deployed for this project |
