Mittwoch, 30. Juli 2014

Integrating Machine Learning Models for Real-Time Prediction into your Existing Workflow (using openscoring and PMML)

In today's world, understanding customers and learning from their behavior is a key component in a company's competitive edge in the market. This not only refers to lower user-retention costs in marketing through intelligently timed re-engagement and a higher lifetime value of users through clever item recommendation systems, but also extends to lower operating costs and a better user experience through modern risk and fraud management: discovering fraudulent participants in marketplaces or payments at risk are vital to the overall performance.

Especially for established companies outside the mobile- and web-service space, adopting new practices and integrating lessons learned from a thorough data analysis can be hard. Established work flows and running systems need to be changed, which can often be a painful experience---especially when outside consultants are hired to conduct the initial study and viability analysis.  Their tools might not be a good fit for the company's established stack.

A good solution to integrate a new layer of data mining and machine learning models is a middleware layer, such as openscoring. It runs independently and can be accessed through a REST API it provides. Existing software does not need to extended with new libraries for data analysis, but only needs to be able to communicate via HTTP, passing on XML requests---a very low bar most systems will pass without any additions---and communication that can be implemented without dedicated machine learning specialists.

The machine learning models can be created offline, in a variety of languages such as Python or R. openscoring. A XML description language, the predictive model markup language (PMML), is then used to deploy the model on a server in the cloud. There is even a heroku ready version that can be set up with a couple of lines of code in a matter of minutes.

In the following, I will outline how a model created using the statistical language R, e.g. by a group of consultants or a in-house team, is deployed as a service, ready to be integrated in your existing frameworks.

The example is very simple: Linear regression, i.e. fitting a linear function to given data points such that a given error function is minimized. You can just open your RStudio or any other environment for R and try it out yourself.

The package for R is called pmml and can be installed using the command
> install.packages("pmml")
There is good documentation available online. Since the native output of the package is XML, make sure you have the XML library installed.

The following code snipped creates a linear regression model for a data file on my web server; please beware that it omits a number of important steps (like dealing with missing data, or normalizing the data). But it suffices to give a rough idea: A model is created and fed to the pmml-function, which in turn creates a XML description. We store the description in a file named glm-pmml.xml.

rawDataDF <- read.CSV("")
rawDataDF <- na.omit(rawDataDF)

target <- rawDataDF$TARGET_Adjusted

N <- length(target)
M <- N-500

data.trainingIndex <- sample(N,M)
data.trainingSet <- rawDataDF[data.trainingIndex,]
data.testSet <- rawDataDF[-data.trainingIndex,]

glm.model <- glm(data.trainingSet$TARGET_Adjusted ~ ., data=data.trainingSet, family="binomial")
glm.pmml <- pmml(glm.model, name="GLM Model", data=data.trainingSet)

xmlFile <- file.path(getwd(),"glm-pmml.xml")

After creating the model and storing it in a PMML file, the next step is its deployment. There are two choices: 1) the model can be uploaded via the REST interface, or 2) it can be given as a command line parameter.
1) The request using the command line tool curl just PUTs

> curl -X PUT --data-binary @glm-pmml.xml -H "Content-type: text/xml" http://localhost:8080/openscoring/model/GLMTest

2) Via command line

> java -cp client-executable-1.1-SNAPSHOT.jar org.openscoring.client.Deployer --model http://localhost:8080/openscoring/model/GLMTest --file glm-pmml.xml

The setup of openscoring itself is straightforward and uncomplicated. Either clone the git on github and deploy it directly to heroku, or download and install it locally---the documentation of openscoring as well as Maven provides step-by-step instructions.

Using a running instance of openscoring with its model is simple: Just send requests via HTTP. For the sake of simplicity, we can just feed back the whole CSV file we used for training:

> curl -X POST --data-binary @simple_model.csv -H "Content-type: text/plain" http://localhost:8080/openscoring/model/GLMTest/csv

The answer will be a list of input-output values. Instead of using curl to send requests via the command line you can easily integrate the API with your existing software projects, e.g. to receive a score to evaluate the likelihood of fraudulent offers on your marketplace. A prominent user of openscoring is AirBnb: The young company uses decision tree models employed on openscoring servers to evaluate and catch fraudulent bookings in real-time.

Are there any drawbacks to this approach? Yes, in some cases: since the machine learning models need to be supported in the PMML language, the newest ideas presented in research papers cannot directly moved into production with openscoring and PMML. But for vast majority of use cases, this certainly does not matter a lot: while new models often have a slight edge in their specific application area as presented in papers, the transfer to a company's application will not automatically translate into the same performance advantage over traditional models. The amount of fine tuning necessary to have any advantage will outbalance any disadvantage a slightly older machine learning model will have.

Stay tuned for my follow-up article covering the use of PMML for data crunching using Hadoop.