Loading...

[PLUS] Implementing Unsupervised Machine Learning to Segment the Ethereum Market


Understanding K-means
K-means is a popular method in unsupervised learning, a type of machine learning where the algorithm discovers patterns without being told what to look for. Imagine you have a bunch of different fruits mixed together and you want to sort them into groups. K-means does something similar with data. It groups data points into clusters based on similarities, like grouping apples with apples and oranges with oranges, but it does this without prior knowledge of the categories. This makes K-means particularly useful for finding hidden structures in data when we don't have predefined labels or categories.

K-means is used to partition a dataset into K clusters. It identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The process can be outlined below - but the basic idea is to group (classify) data of certain characteristics that are similar in nature to each other without knowing anything explicit about the outcome desired. Because there is no "targeted outcome", this methodology in machine learning is classified as unsupervised

Process- We will download and load proprietary data metrics calculated by ETH tick available to Plus+ members and run an unsupervised machine learning algorithm, KMEANs, to the data set. Before running the unsupervised learning algo, we will scale each of the columns independently to range between 0 and 1 so that numeric columns will not influence one another (a USDT volume column doesn't outweight a pct change column for example). Additionally, because we have so many features (columns), we will apply a dimensionality technique (PCA - principal component analysis) to reduce the features to only two total. This way we can visualize the dynamics of the data set at the end of the KMEANs sorting into something intelligible.

  • Initialization-K initial "means" (centroids) are defined randomly.
  • Assignment Step- Each data point is assigned to its nearest centroid based on the squared Euclidean distance.
  • Update Step- The centroids are recalculated as the mean of all data points belonging to that cluster.
  • Iteration- The assignment and update steps are repeated until the centroids no longer move significantly or a maximum number of iterations is reached.
  • Choosing K- The number of clusters, K, is a crucial parameter. Methods like the Elbow Method or the Silhouette Score can help determine the optimal number of clusters. We demonstrate how to pick the "correct" number of clusters using the elbow method in this post


Noteworthy Findings
The K-MEANs algorithm seems to sort the top 2 PCA components into 3 groups neatly. This analysis leads us to believe that the feature set correspond to specific types of daily outcomes. (Ie. Certain volume characteristics and buy/sell patterns or counts can be grouped together and point to similar "day structures"). The PCA reduces all of the features into 2 dimensions, but it does this by essentially weighting all of the features at once. (think of the features of the dataset in a multiple variable linear equation with certain "weights"). Some big questions to further consider: What is different about the features in days in cluster 0 vs cluster 1? Do the same structures appear in segmenting the Bitcoin market with unsupervised machine learning? Do these clusters have any predictive power in terms of daily price change?





The Code
Our code uses live cryptocurrency historical OHLCV data (pulls whatever is latest CSV / date available) and then will combine with our own, CryptoDataDownload Plus+ proprietary volume metrics calculated from raw transactional data on ETH on Binance. The volume focused feature set includes such columns as: 1) average USD size 2) Number of Buys 3) Number of Sells 4) VWAP 5) Sell Total Volume 6) Buy Total Volume 7) Largest Buy Transaction 8) Largest Sell Transaction 9) Standard Deviation in Transaction Size in USD (and more!). The numerical data is scaled between 0-1 for the K-Means algorithm, and to control for outliers. The K-MEANs model object is created with 3 clusters and then applied to the scaled data. Then PCA analysis is also run across the scaled data. Cluster labels (1, 2, or 3) are added back to the data so we know how K-MEANs grouped the feature set and then graphed. Every line of code is commented so you know exactly what each does and you can modify to fit your purpose!

This is a premium post. Create Plus+ Account to view the live, working codebase for this article.




Notice: Information contained herein is not and should not be construed as an offer, solicitation, or recommendation to buy or sell securities. The information has been obtained from sources we believe to be reliable; however no guarantee is made or implied with respect to its accuracy, timeliness, or completeness. Author does not own the any crypto currency discussed. The information and content are subject to change without notice. CryptoDataDownload and its affiliates do not provide investment, tax, legal or accounting advice.

This material has been prepared for informational purposes only and is the opinion of the author, and is not intended to provide, and should not be relied on for, investment, tax, legal, accounting advice. You should consult your own investment, tax, legal and accounting advisors before engaging in any transaction. All content published by CryptoDataDownload is not an endorsement whatsoever. CryptoDataDownload was not compensated to submit this article. Please also visit our Privacy policy; disclaimer; and terms and conditions page for further information.

THE PERFORMANCE OF TRADING SYSTEMS IS BASED ON THE USE OF COMPUTERIZED SYSTEM LOGIC. IT IS HYPOTHETICAL. PLEASE NOTE THE FOLLOWING DISCLAIMER. CFTC RULE 4.41: HYPOTHETICAL OR SIMULATED PERFORMANCE RESULTS HAVE CERTAIN LIMITATIONS. UNLIKE AN ACTUAL PERFORMANCE RECORD, SIMULATED RESULTS DO NOT REPRESENT ACTUAL TRADING. ALSO, SINCE THE TRADES HAVE NOT BEEN EXECUTED, THE RESULTS MAY HAVE UNDER-OR-OVER COMPENSATED FOR THE IMPACT, IF ANY, OF CERTAIN MARKET FACTORS, SUCH AS LACK OF LIQUIDITY. SIMULATED TRADING PROGRAMS IN GENERAL ARE ALSO SUBJECT TO THE FACT THAT THEY ARE DESIGNED WITH THE BENEFIT OF HINDSIGHT. NO REPRESENTATION IS BEING MADE THAT ANY ACCOUNT WILL OR IS LIKELY TO ACHIEVE PROFIT OR LOSSES SIMILAR TO THOSE SHOWN. U.S. GOVERNMENT REQUIRED DISCLAIMER: COMMODITY FUTURES TRADING COMMISSION. FUTURES AND OPTIONS TRADING HAS LARGE POTENTIAL REWARDS, BUT ALSO LARGE POTENTIAL RISK. YOU MUST BE AWARE OF THE RISKS AND BE WILLING TO ACCEPT THEM IN ORDER TO INVEST IN THE FUTURES AND OPTIONS MARKETS. DON’T TRADE WITH MONEY YOU CAN’T AFFORD TO LOSE. THIS IS NEITHER A SOLICITATION NOR AN OFFER TO BUY/SELL FUTURES OR OPTIONS. NO REPRESENTATION IS BEING MADE THAT ANY ACCOUNT WILL OR IS LIKELY TO ACHIEVE PROFITS OR LOSSES SIMILAR TO THOSE DISCUSSED ON THIS WEBSITE. THE PAST PERFORMANCE OF ANY TRADING SYSTEM OR METHODOLOGY IS NOT NECESSARILY INDICATIVE OF FUTURE RESULTS.

Latest Posts
Follow Us
Notify me of new content