Loading...

Fixing Data Issues With Volumes in Timeseries


Dealing With Erroneous Volume Metrics
We store data for cryptocurrencies and organize by the respective, individual exchanges. This data is often directly extracted from the exchange's API, and so it should be highly reliable. However, when dealing with timeseries data, it is almost a certainty to encounter bad metrics from time to time. Sometimes there are extremely plausible explanations for why an interval data point may be corrupted. For cryptocurrencies that trade 24/7, exchanges may be unavailable for trading due to system maintenance (scheduled or unscheduled). In the timeseries data on CryptoDataDownload, these "errors" often appear as data points with volumes of zero. While often legitimate, the missing volume becomes a potential problem when trying to fit a model to the data.

Exploring Imputation Solutions
There are several techniques that can be used to either backfill (impute) the missing volume metrics, or smooth them out by interpreting them from the overall volume distribution. Each technique has pros and cons that vary, and some require a deeper level of technical calculation. The list below is a non-exhaustive use of techniques that may be used to impute missing volume metrics in cryptocurrency data timeseries.

    1) Take simple average between periods. If there is only one volume data point missing between 2 others, you can take a simple "average" between the valid points to smooth out the missing. This method is far the simplest.
    2) Carry-forward the previous time intervals' volume data point. This is also a simple, straight forward solution that carries forward the previous valid volume metric to fill in the missing point.
    3) Discard the missing timeseries. Certain models may not need continuity between time intervals in order to fit the model. If this is the case, it may make sense to discard the missing values rather than training the model with a "synthetic" volume metric.
    4) Use quantiles from the distribution to fit the missing data point(s). This technique is by far the most involved compared to the previous three. The goal is to find the 50th percentile from the volume distribution in order to find the most "central" or average value to use. This requires you to use a parametric function on the entire distribution of returns, and find the percentile that corresponds to the middle of the distribution (50%). Then you would use this value to impute the missing volumes.




    Notice: Information contained herein is not and should not be construed as an offer, solicitation, or recommendation to buy or sell securities. The information has been obtained from sources we believe to be reliable; however no guarantee is made or implied with respect to its accuracy, timeliness, or completeness. Author does not own the any crypto currency discussed. The information and content are subject to change without notice. CryptoDataDownload and its affiliates do not provide investment, tax, legal or accounting advice.

    This material has been prepared for informational purposes only and is the opinion of the author, and is not intended to provide, and should not be relied on for, investment, tax, legal, accounting advice. You should consult your own investment, tax, legal and accounting advisors before engaging in any transaction. All content published by CryptoDataDownload is not an endorsement whatsoever. CryptoDataDownload was not compensated to submit this article. Please also visit our Privacy policy; disclaimer; and terms and conditions page for further information.

Latest Posts
Follow Us
Notify me of new content