Here's a quick demo of how to fit a binary classification model with caretEnsemble. Please note that I haven't spent as much time debugging caretEnsemble for classification models, so there's probably more bugs than my last post. Also note that multi class models are not yet supported.
Modern Toolmaking
Practical tools for predictive modeling, data science, machine learning and web scraping
Saturday, March 16, 2013
Wednesday, March 13, 2013
New package for ensembling R models
I've written a new R package called caretEnsemble for creating ensembles of caret models in R. It currently works well for regression models, and I've written some preliminary support for binary classification models.
Thursday, January 24, 2013
Time series cross-validation 5
The caret package for R now supports time series cross-validation! (Look for version 5.15-052 in the news file). You can use the createTimeSlices function to do time-series cross-validation with a fixed window, as well as a growing window. This function generates a list of indexes for the training set, as well as a list of indexes for the test set, which you can then pass to the trainControl object.
Friday, July 6, 2012
Error metrics for multi-class problems in R: beyond Accuracy and Kappa
The caret package for R provides a variety of error metrics for regression models and 2-class classification models, but only calculates Accuracy and Kappa for multi-class models. Therefore, I wrote the following function to allow caret:::train to calculate a wide variety of error metrics for multi-class problems:
Labels:
caret,
error metrics,
kaggle,
predictive modeling,
R
Monday, June 11, 2012
Time series cross-validation 4: forecasting the S&P 500
I finally got around to publishing my time series cross-validation package to github, and I plan to push it out to CRAN shortly.
Labels:
backtesting,
cross-validation,
finance,
forecasting,
model building,
time-series
Monday, January 23, 2012
My first R package: parallel differential evolution
UPDATE: a better parallel algorythm will be included in a future version of DEoptim, so I've removed my package from CRAN. You can still use the code from this post, but keep Josh's comments in mind.
Last night I was working on a difficult optimization problems, using the wonderful DEoptim package for R. Unfortunately, the optimization was taking a long time, so I thought I'd speed it up using a foreach loop, which resulted in the following function:
Here's what's going on: I divide the bounds for each parameter into n segments, and use a foreach loop to run DEoptim on each segment, collect the results of the loop, and then return the optimization results for the segment with the lowest value of the objective function. Additionally, I defined a "parDEoptim" class to make it easier to combine the results during the foreach loop. All of the work is still being done by the DEoptim algorithm. All I've done is split up the problem into several chunks.
Here is an example, straight out of the DEoptim documentation:
In theory, on a 20-core machine, this should run a bit faster than the serial example. Note that you may need to set itermax for the parallel run at a higher value than (itermax for the serial run)/(number of segments), as you want to make sure the algorithm can find the minimum of each segment. Also note that, in this example, there are 20 segments on the interval c(-10,-10) to c(10,10), which means that 2 of the segments have boundaries at c(1,1), which is the global minimum of the function. The DEoptim algorithm has no trouble finding a solution at the boundary of the parameter space, which is why it's so easy to parallelize.
Rumor has it that the next version of DEoptim will include foreach parallelization, but if you can't wait until then, I rolled up the above function into an R package and posted it to CRAN. Let me know what you think!
Last night I was working on a difficult optimization problems, using the wonderful DEoptim package for R. Unfortunately, the optimization was taking a long time, so I thought I'd speed it up using a foreach loop, which resulted in the following function:
Here's what's going on: I divide the bounds for each parameter into n segments, and use a foreach loop to run DEoptim on each segment, collect the results of the loop, and then return the optimization results for the segment with the lowest value of the objective function. Additionally, I defined a "parDEoptim" class to make it easier to combine the results during the foreach loop. All of the work is still being done by the DEoptim algorithm. All I've done is split up the problem into several chunks.
Here is an example, straight out of the DEoptim documentation:
In theory, on a 20-core machine, this should run a bit faster than the serial example. Note that you may need to set itermax for the parallel run at a higher value than (itermax for the serial run)/(number of segments), as you want to make sure the algorithm can find the minimum of each segment. Also note that, in this example, there are 20 segments on the interval c(-10,-10) to c(10,10), which means that 2 of the segments have boundaries at c(1,1), which is the global minimum of the function. The DEoptim algorithm has no trouble finding a solution at the boundary of the parameter space, which is why it's so easy to parallelize.
Rumor has it that the next version of DEoptim will include foreach parallelization, but if you can't wait until then, I rolled up the above function into an R package and posted it to CRAN. Let me know what you think!
Labels:
DEoptim,
optimization,
packages,
R
Thursday, December 29, 2011
Benchmarking time series models
This is a quick post on the importance of benchmarking time-series forecasts. First we need to reload the functions from my last few posts on times-series cross-validation. (I copied the relevant code at the bottom of this post so you don't have to find it).
Next, we need to load data for the S&P 500. To simplify things, and allow us to explore seasonality effects, I'm going to load monthly data, back to 1980.
Next, we need to load data for the S&P 500. To simplify things, and allow us to explore seasonality effects, I'm going to load monthly data, back to 1980.
Labels:
backtesting,
cross-validation,
finance,
forecasting,
model building,
time-series
Monday, December 12, 2011
Time series cross-validation 3
I've updated my time-series cross validation algorithm to fix some bugs and allow for a possible xreg term. This allows for cross-validation of multivariate models, so long as they are specified as a function with the following paramters: x (the series to model), xreg (independent variables, optional), newxreg (xregs for the forecast), and h (the number of periods to forecast). Note that h should equal the number of rows in the xreg matrix. Also note that you need to forecast the xreg object BEFORE forecasting your x object. For example, if you wish to forecast 12 months into the future, your xreg object should have 12 extra rows.
Labels:
backtesting,
cross-validation,
finance,
forecasting,
model building,
time-series
Monday, December 5, 2011
A pure R poker hand evaluator
There's already a lot of great posts out there about poker hand evaluators, so I'll keep this short. Kenneth J. Shackleton recently released a very slick 5-card and 7-card poker hand evaluator called SpecialK. This evaluator is licensed under GPL 3, and is described in detail in 2 blog posts: part 1 and part 2. Since the provided code is open source, I felt free to hack around with it a bit, and ported the python source to R.
Labels:
pokeR
Tuesday, November 22, 2011
Time series cross-validation 2
In my previous post, I shared a function for parallel time-series cross-validation, based on Rob Hyndman's code. I thought I'd expand on that example a little bit, and share some additional wrapper functions I wrote to test other forecasting algorithms. Before you try this at home, be sure to load the cv.ts and tsSummary functions from my last post.
Labels:
backtesting,
cross-validation,
finance,
forecasting,
model building,
time-series
Subscribe to:
Posts (Atom)