[PLUS] Training a SVM Model to Predict Next Day Positive or Negative Returns
Support Vector Machines
Support Vector Machines (SVM) have been a popular machine learning technique for classifying data since the early 1990s. In basic terms, the SVM technique attempts to find a line that can separate the two classifications of data into groups based on a set of parameters. Predictions can then be made afterwards with new data by simply observing which side of the grouping the new data points fall. Some of the benefits of using SVM as a machine learning technique are: 1) it is effective finding classifications with limited amounts of data 2) it can handle nonlinear relationships in the data 3) it is robust against overfitting datasets (leading to a false sense of prediction confidence) and 4) it can handle datasets with outliers. The goal of the model design in this write up is to predict whether the following day will be a positive or negative return.
Data Feature Set
We will be analyzing ETHUSDT data from Binance to create the model. One part of the data calculates the percentage return of ETHUSDT each day using the freely available OHLCV time series. The other part of the feature set uses our internal Meta Summary Statistics, which includes meta statistical data that is calculated from raw transactional data. This data takes a closer inspection of Volume and transactional data to calculate fields such as: Average USD Trade Size, Volume Weight Avg Price (VWAP), number of buy (or sell) transactions, largest Buy (sell) orders, total buy (sell) volumes, and the standard deviation of transaction USD amounts. All data resources can be found on the Binance historical data page. The last column of our data set is a binary classification value either 1 or 0. It is 1 if that days' return is positive and 0 if not positive. This last column will ultimately be used to train the model and test its performance.
This section will talk through the data setup and model design process at a very high level, and the following section in code will provide more specifics on implementation. One of the first things that needs to be done in order to build the model is setup the data structure. For this model, we have a set of parameters (often called "features") that we will use as inputs to the model in order for the model to "predict" an outcome. The outcome variable is called the "target" variable. Generally, you can think of the data structure as an excel table of columns, and the farthest right column is the "desired outcome" or target variable. Once the data is setup, the next step is to generally split the data between a "training set" and a "testing set". The training set is used to tune the parameters of the model, and then the testing set is used to make predictions to see how good the model is. There are a few schools of thought on what the proper percentage of splitting is between training and testing, but we will use a 80/20 split in this model setup. (this split percentage can be changed in the code). After we split the data, we need to standardize and scale the training and test variables since we are working with variables of different scales. Some variables are "largest trade size" and another is "count of total trades". The largest trade could be 50 million and the trade count could be 10,000; and so in order for the model to properly compare these variables, they need to be scaled so that one quantitative variable does not outweigh the others. Finally, we pass the "training set" to the model to "train" it, then make our test predictions. The last step is to assess how well the model made predictions of this "test" set and evaluate how the model does on the "unseen data".
The Python code has 3 functions to run the model. When the script is executed, it calls the "main" function first. This function just provides some structure to calling the other functions that follow. Data is loaded from the web endpoints in the load_data function, and then staged for the model. This staging process includes dropping columns that we do not need, and joining the two datasets together. All of the staged data is loaded into Pandas DataFrames. Once the data is staged, the next function is called "create_model" and we pass the loaded/staged data to this function. The data is split between the "Features" and the "target" variables, then split into training and testing sets, then scaled to account for the quantitative variables, and then ultimately used to train the model and make predictions. Some basic model performance statistics are printed at the end: 1) Accuracy of model 2) Precision 3) Recall and 4) F1 Score. All of this is done in 85 lines of code, and every line of code is commented so that you can change the variables or model to your purpose.
The model is showing an accuracy score of ~64% when using an SVM C parameter equal to 5 and a radial basis function kernel. The choice of parameters and "tuning" the model goes beyond the scope of this article, but there are several kernel functions that the machine uses to "separate" the data in SVM. The most classic kernel is "linear", but there are other higher order kernels that projects nonlinear and complex data onto higher planes, and the radial basis function (RBF) is one of those nonlinear kernels. To summarize: The SVM model, using the features of the Meta Statistics and asset returns, has a better than coinflips chance (50/50) of predicting whether or not the next day will show a positive or negative return. Further parameter tuning and additional features could produce better results but goes beyond the scope of this article.
SVM Model Performance Results
*Results can drastically vary based on the test "seed" or starting point of the split of the data
This is a premium post. Create Plus+ Account to view the live, working codebase for this article.
Notice: Information contained herein is not and should not be construed as an offer, solicitation, or recommendation to buy or sell securities. The information has been obtained from sources we
believe to be reliable; however no guarantee is made or implied with respect to its accuracy, timeliness, or completeness. Author does not own the any crypto currency discussed. The information
and content are subject to change without notice. CryptoDataDownload and its affiliates do not provide investment, tax, legal or accounting advice.
This material has been prepared for informational purposes only and is the opinion of the author, and is not intended to provide, and should not be relied on for, investment, tax, legal,
accounting advice. You should consult your own investment, tax, legal and accounting advisors before engaging in any transaction. All content published by CryptoDataDownload is not an
THE PERFORMANCE OF TRADING SYSTEMS IS BASED ON THE USE OF COMPUTERIZED SYSTEM LOGIC. IT IS HYPOTHETICAL.
PLEASE NOTE THE FOLLOWING DISCLAIMER.
CFTC RULE 4.41: HYPOTHETICAL OR SIMULATED PERFORMANCE RESULTS HAVE CERTAIN LIMITATIONS. UNLIKE AN ACTUAL
PERFORMANCE RECORD, SIMULATED RESULTS DO NOT REPRESENT ACTUAL TRADING. ALSO, SINCE THE TRADES HAVE NOT BEEN
EXECUTED, THE RESULTS MAY HAVE UNDER-OR-OVER COMPENSATED FOR THE IMPACT, IF ANY, OF CERTAIN MARKET FACTORS,
SUCH AS LACK OF LIQUIDITY. SIMULATED TRADING PROGRAMS IN GENERAL ARE ALSO SUBJECT TO THE FACT THAT THEY ARE
DESIGNED WITH THE BENEFIT OF HINDSIGHT. NO REPRESENTATION IS BEING MADE THAT ANY ACCOUNT WILL OR IS LIKELY
TO ACHIEVE PROFIT OR LOSSES SIMILAR TO THOSE SHOWN. U.S. GOVERNMENT REQUIRED DISCLAIMER: COMMODITY FUTURES
TRADING COMMISSION. FUTURES AND OPTIONS TRADING HAS LARGE POTENTIAL REWARDS, BUT ALSO LARGE POTENTIAL RISK.
YOU MUST BE AWARE OF THE RISKS AND BE WILLING TO ACCEPT THEM IN ORDER TO INVEST IN THE FUTURES AND OPTIONS MARKETS.
DON’T TRADE WITH MONEY YOU CAN’T AFFORD TO LOSE. THIS IS NEITHER A SOLICITATION NOR AN OFFER TO BUY/SELL FUTURES
OR OPTIONS. NO REPRESENTATION IS BEING MADE THAT ANY ACCOUNT WILL OR IS LIKELY TO ACHIEVE PROFITS OR LOSSES
SIMILAR TO THOSE DISCUSSED ON THIS WEBSITE. THE PAST PERFORMANCE OF ANY TRADING SYSTEM OR METHODOLOGY IS NOT
NECESSARILY INDICATIVE OF FUTURE RESULTS.