Jekyll2018-03-29T10:40:29+00:00https://griefberg.me/GriefbergA blog about data science, analytics and technologyStatistical comparison of two means2018-03-29T11:00:00+00:002018-03-29T11:00:00+00:00https://griefberg.me/how-to-compare-two-sample-means<p>One common example of statistical hypotheses testing is means comparison. Imagine that you’re analyzing results of a survey about the dependence of gender and wage. All you want to know is whether any difference in average wages exists or not. Let’s illustrate this case with the data of the survey <a href="https://vincentarelbundock.github.io/Rdatasets/doc/Ecdat/Wages1.html">Wages, Experience and Schooling</a>. Its results contain data about hourly wages (in dollars of 1980) of males and females in USA in 1980.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; library(data.table) &gt; dt &lt;- fread('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/Ecdat/Wages1.csv') &gt; head(dt) exper sex school wage 1: 9 female 13 6.315296 2: 12 female 12 5.479770 3: 11 female 11 3.642170 4: 9 female 14 4.593337 5: 8 female 14 2.418157 6: 9 female 14 2.094058 </code></pre></div></div> <p>The first thing you can do is to calculate average wages for females and males correspondingly:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; dt[, .(avg_wage=mean(wage, na.rm=TRUE)), by='sex'] sex avg_wage 1: female 5.146924 2: male 6.313021 </code></pre></div></div> <p>However, you couldn’t just compare these numbers and make a conclusion from it. No! Firstly, you need to understand what the population you want to make a conclusion about. In our case we’re dealing with two populations – males and females in USA. However, we don’t have the data about the whole population. All we have is our survey data which presents just two samples containing randomly selected men and women in USA. We’re using <strong>samples as an approach to investigate populations</strong>. When we calculate a sample mean, we use it only as an estimation of a population mean. This estimation has an error, i.e. actually a population mean lies in the interval <em>(sample mean - error, sample mean + error)</em>. This is called a <strong>confidence interval</strong>.</p> <h2 id="what-are-the-confidence-intervals">What are the confidence intervals?</h2> <p>Firstly, we need to familiarize ourselves with probably two most important theorems in the world of scientific research: <strong>Law of Large Numbers (LLN)</strong> and <strong>Central Limit Theorem (CLT)</strong>. The latter one is actually a set of theorems, but it doesn’t matter for now. I will present you a brief overview of them but if you want to get a deeper understanding, watch <a href="https://youtu.be/OprNqnHsVIA">this</a>.</p> <p><strong>Law of Large Numbers (LLN)</strong><br /> This theorem tells us that if we have a sample with a number of observations approaching to the infinity (in practice, we should have at least 30 observations) then a sample mean <script type="math/tex">\bar x</script> is a good approximation of a population mean <script type="math/tex">μ</script>:</p> <p>\begin{equation} \bar X_n \to μ \text{ as } n \to \infty \text{ with the probability = 1} \end{equation}</p> <p><strong>Central Limit Theorem (CLT)</strong> <br /> Imagine that we have some distribution (doesn’t matter which exactly!) with known <script type="math/tex">μ</script> (population mean) and <script type="math/tex">\sigma^2</script> (variance). We construct a lof of samples from this distribution with size <script type="math/tex">n</script> and calculate all sample means <script type="math/tex">\bar x_1, \bar x_2, ..., \bar x_n , ..., \bar x_k</script>, then this new random variable of sample means <script type="math/tex">\bar X_n</script> will follow Normal distribution with population mean <script type="math/tex">μ</script> and variance <script type="math/tex">\frac{\sigma^2}{n}</script>:</p> <p>\begin{equation} <br /> \bar X_n \sim N(μ, \frac{\sigma^2}{n}) <br /> \end{equation}</p> <p>If we <a href="https://en.wikipedia.org/wiki/Standard_score">standardize</a> <script type="math/tex">\bar X_n</script>, then we get the following:</p> <p>\begin{equation} <br /> \frac{\bar X_n - μ}{\sigma / \sqrt{n}} \sim N(0, 1) <br /> \end{equation}</p> <p>This will allow us to use the properties of Standard Normal distribution:</p> <p><img src="https://griefberg.me/assets/images/how-to-compare-means/confidence_interval.png" alt="Confidence Interval" /></p> <p>On the picture above we could see Z-score (or technically standard normal distribution). We know everything about this distribution and especially that 95 % of distribution’s values vary between -1.96 and 1.96. Stop. Does it mean that 95 % of values of our random variable <script type="math/tex">\frac{\bar X_n - μ}{\sigma / \sqrt{n}}</script> also lie in the range [-1.96, 1.96]? Yes, exactly! Let’s write this in a more statistical way:</p> <p>\begin{equation} <br /> P(-1.96 \leq \frac{\bar X_n - μ}{\sigma / \sqrt{n}} \leq 1.96) \\ <br /> P(\frac{-1.96 \sigma}{\sqrt{n}} \leq \bar X_n - μ \leq \frac{1.96 \sigma}{\sqrt{n}}) \\ P(\bar X_n - \frac{1.96 \sigma}{\sqrt{n}} \leq μ \leq \bar X_n + \frac{1.96 \sigma}{\sqrt{n}}) \end{equation}</p> <p>So, it means that the actual mean <script type="math/tex">μ</script> with 95 % probability will lie in the interval <script type="math/tex">\bar X_n \pm \frac{1.96 \sigma}{\sqrt{n}}</script>. If we want to get the interval at another percent of confidence, e.g. 99 % or 90 %, we just need to get the corresponding <a href="http://users.stat.ufl.edu/~athienit/Tables/Ztable.pdf">Z-value</a>.</p> <h2 id="calculation">Calculation</h2> <p>Cool, we understand the concept of confidence intervals. Let’s come back to our example of comparing wages of males and females in USA. We’ve already calculated means so now we just can use what we’ve learned or not? Practically yes, but there is one detail we need to take into account. We don’t know anything about males’ or females’ wage distribution except our sample. It means that we don’t know neither <script type="math/tex">μ</script> neither <script type="math/tex">\sigma</script>. In this case people use a sample variance (<script type="math/tex">S^2</script>) as an approximation of a population variance (<script type="math/tex">\sigma^2</script>) and a t-statistic instead of a z-statistic. Why a t-statistic? Because t-distribution converges to Normal distribution when a sample size is big enough and it has fatter tails when a sample size is small (n &lt; 30): this makes it more reliable for an estimation of confidence intervals and prevents its underestimation.</p> <p>The final formula for the confidence interval for <script type="math/tex">μ</script> we’re going to use will be the following:</p> <p>\begin{equation} <br /> \bar X_n \pm \frac{t_{1-\frac{\alpha}{2}} S}{\sqrt{n}} \\ \text{where} \\ t_{1-\frac{\alpha}{2}} \text{ – t-statistic value for } 1-\frac{\alpha}{2} \text{ percent of confidence,} \\ \alpha \text{ – a level of the tolerable error,} \\ S \text{ – sample standard deviation.} <br /> \end{equation}</p> <p>We have rather big samples for both males and females, so in our case a t-statistic value for 95 % confidence interval will be very close to 1.96 (see <a href="http://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf">the table</a>). What’s next? We need to calculate the sample size and the sample standard deviation for both males and females’ wages:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&gt; table(dt$sex) # sample size female male 1569 1725 &gt; dt[, .(std_wage=sd(wage, na.rm=TRUE)), by='sex'] # standard deviation sex std_wage 1: female 2.876237 2: male 3.498861 </code></pre></div></div> <p>Finally we can calculate the confidence intervals:</p> <p>\begin{equation} <br /> μ_{female} \in [5.15 - \frac{1.96 * 2.88}{\sqrt{1569}}, 5.15 + \frac{1.96 * 2.88}{\sqrt{1569}}] \\ μ_{female} \in [5, 5.29] \\ \end{equation}</p> <p>\begin{equation} <br /> μ_{male} \in [6.31 - \frac{1.96 * 3.5}{\sqrt{1725}}, 6.31 + \frac{1.96 * 3.5}{\sqrt{1725}}] \\ μ_{male} \in [6.15, 6.48] \\ \end{equation}</p> <p>R code for these calculations:</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>means &lt;- dt[, .(avg_wage=mean(wage, na.rm=TRUE)), by='sex'] n &lt;- table(dt$sex) # sample size stds &lt;- dt[, .(std_wage=sd(wage, na.rm=TRUE)), by='sex'] # standard deviation # confidence interval female_low &lt;- means[sex == 'female']$avg_wage + qt(0.025, n['female']) * (stds[sex == 'female']$std_wage / sqrt(n['female'])) female_high &lt;- means[sex == 'female']$avg_wage + qt(0.975, n['female']) * (stds[sex == 'female']$std_wage / sqrt(n['female'])) male_low &lt;- means[sex == 'male']$avg_wage + qt(0.025, n['male']) * (stds[sex == 'male']$std_wage / sqrt(n['male'])) male_high &lt;- means[sex == 'male']$avg_wage + qt(0.975, n['male']) * (stds[sex == 'male']$std_wage / sqrt(n['male'])) </code></pre></div></div> <h2 id="conclusion">Conclusion</h2> <p>Okay, now we see that average wages for males and females are really different because <strong>confidence intervals do not have intersections</strong>. We see that with a probability 95 % the average wage for females varies from 5 to 5.29 while for males – from 6.15 to 6.48. We can conclude that on average males earned 22.5 % more than females in USA in 1980.</p>GriefbergOne common example of statistical hypotheses testing is means comparison. Imagine that you’re analyzing results of a survey about the dependence of gender and wage. All you want to know is whether any difference in average wages exists or not. Let’s illustrate this case with the data of the survey Wages, Experience and Schooling. Its results contain data about hourly wages (in dollars of 1980) of males and females in USA in 1980.Cohort Approach For Customer Lifetime Value (LTV) Calculation2018-02-10T12:00:00+00:002018-02-10T12:00:00+00:00https://griefberg.me/how-to-calculate-ltv-cohorts<p>I hope you all read my previous <a href="http://griefberg.me/how-to-calculate-ltv/">post</a> where I tried to understand a general LTV concept. Now let’s begin our investigation how to calculate LTV if we have enough historical data (at least, 1 year). Let’s remind us a general LTV formula from the previous post:</p> <script type="math/tex; mode=display">\text{ Customer LTV = } \text{R}_0 * \text{AGMPU}_0 + \text{R}_1 * \frac{\text{AGMPU}_1}{\text{ (1 + d)}^1} + \text{ ...} + \text{R}_n * \frac{\text{AGMPU}_n}{\text{ (1 + d)}^n} + \text{ ... } \qquad \text{ (1) } \\\ \text{where} \\\ \text{d - discount rate, i.e. the interest rate you can get if you put your money in the bank } \\\ \text{R}_n \text{ – cohort retention rate in month n (e.g 100 % in the 0th month, 35 % in the 1st month, etc.) } \\\ \text{ AGMPU}_n \text{ – average gross margin per user in month n (e.g. 5 bucks in the 0th month, 11 bucks in 1st month, etc.) }</script> <p>The algorithm for calculating LTV via a cohort approach is the following:</p> <ol> <li>Calculate historical retention rates and AGMPU for cohorts (of course, if you have historical data, otherwise, use this <a href="http://griefberg.me/how-to-calculate-ltv/">approach</a>).</li> <li>Calculate average historical retention rates and AGMPU weighted on cohorts’ sizes.</li> <li>Fit statistical models for retention rates and AGMPU versus a cohorts lifespan.</li> <li>Predict retention rates and AGMPU for the future using created models (ideally, you will find such an exponential function for a retention rate that it will go to zero after some lifetime).</li> <li>Calculate LTV using formula (1)</li> </ol> <h3 id="historical-retention-and-agmpu">Historical retention and AGMPU</h3> <p>I use the dataframe <strong>cdnowElog</strong> from BTYD package for my calculations (you could find the R code I wrote <a href="https://github.com/Griefberg/Griefberg.github.io/tree/master/posts_scripts/how-to-calculate-ltv-cohorts.R">here</a>). After the first manipulations the data looks like that:</p> <table> <thead> <tr> <th style="text-align: left">cust</th> <th style="text-align: left">date</th> <th style="text-align: left">price</th> <th style="text-align: left">birth</th> <th style="text-align: left">period</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">1</td> <td style="text-align: left">1997-01-01</td> <td style="text-align: left">29.33</td> <td style="text-align: left">1997-01-01</td> <td style="text-align: left">0</td> </tr> <tr> <td style="text-align: left">1</td> <td style="text-align: left">1997-01-18</td> <td style="text-align: left">29.73</td> <td style="text-align: left">1997-01-01</td> <td style="text-align: left">0</td> </tr> <tr> <td style="text-align: left">1</td> <td style="text-align: left">1997-08-02</td> <td style="text-align: left">14.96</td> <td style="text-align: left">1997-01-01</td> <td style="text-align: left">7</td> </tr> <tr> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> </tr> </tbody> </table> <p><strong>Cust</strong>, <strong>date</strong> and <strong>price</strong> were in the initial dataset. What I did is calculate the <strong>birth</strong> field (a starting month for every cohort a customer belongs to) and the <strong>period</strong> (how many months passed from the first order of a customer). Then I did some GroupBy work and got all the historical values I needed (remind: you could find the whole code <a href="https://github.com/Griefberg/Griefberg.github.io/tree/master/posts_scripts/how-to-calculate-ltv-cohorts.R">here</a>). The first 3 rows look like that:</p> <table> <thead> <tr> <th style="text-align: left">birth</th> <th style="text-align: left">period</th> <th style="text-align: left">retained_users</th> <th style="text-align: left">revenue</th> <th style="text-align: left">orders</th> <th style="text-align: left">cohort_size</th> <th style="text-align: left">retention</th> <th style="text-align: left">agmpu</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">1997-01-01</td> <td style="text-align: left">0</td> <td style="text-align: left">781</td> <td style="text-align: left">28592.70</td> <td style="text-align: left">1.13</td> <td style="text-align: left">781</td> <td style="text-align: left">1.00</td> <td style="text-align: left">10.98</td> </tr> <tr> <td style="text-align: left">1997-01-01</td> <td style="text-align: left">1</td> <td style="text-align: left">124</td> <td style="text-align: left">7003.73</td> <td style="text-align: left">1.45</td> <td style="text-align: left">781</td> <td style="text-align: left">0.16</td> <td style="text-align: left">16.94</td> </tr> <tr> <td style="text-align: left">1997-01-01</td> <td style="text-align: left">2</td> <td style="text-align: left">95</td> <td style="text-align: left">4241.64</td> <td style="text-align: left">1.38</td> <td style="text-align: left">781</td> <td style="text-align: left">0.12</td> <td style="text-align: left">13.39</td> </tr> <tr> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> </tr> </tbody> </table> <p>At this point I got the historical data about all cohorts:</p> <ul> <li><strong>Retained users</strong> – how many members from the initial cohort made an order in the i-th period.</li> <li><strong>Revenue</strong> – sum of the revenue by all cohorts members.</li> <li><strong>Orders</strong> – average number of orders per cohort member in i-th period.</li> <li><strong>Cohort size</strong> – initial cohort size.</li> <li><strong>Retention</strong> - proportion of cohorts members who made an order in the i-th period.</li> <li><strong>AGMPU</strong> - average gross margin per user in the i-th period (if you want to find a more detailed definition of this, please look my previous <a href="http://griefberg.me/how-to-calculate-ltv/">post</a>).</li> </ul> <h3 id="average-historical-retention-rates-and-agmpu">Average historical retention rates and AGMPU</h3> <p>At this stage we need to calculate an average (weighted by a cohort size to pay more attention to big cohorts) of everything we got on the previous one by the period:</p> <table> <thead> <tr> <th style="text-align: left">period</th> <th style="text-align: left">avg.retention</th> <th style="text-align: left">avg.agmpu</th> <th style="text-align: left">avg.orders</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">0</td> <td style="text-align: left">1.00</td> <td style="text-align: left">12.07</td> <td style="text-align: left">1.17</td> </tr> <tr> <td style="text-align: left">1</td> <td style="text-align: left">0.15</td> <td style="text-align: left">15.24</td> <td style="text-align: left">1.39</td> </tr> <tr> <td style="text-align: left">2</td> <td style="text-align: left">0.12</td> <td style="text-align: left">13.17</td> <td style="text-align: left">1.30</td> </tr> <tr> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> <td style="text-align: left">…</td> </tr> </tbody> </table> <h3 id="statistical-models-for-retention-rates-and-agmpu">Statistical models for retention rates and AGMPU</h3> <p>Here we reach the most interesting part: modeling our average retention and average AGMPU (I’m not touching average orders here, just leave it here to show that we could calculate a lot of different metrics by cohort). What we need to do now is investigate the behavior of average retention and average AGMPU relatively to the cohort period and find the mathematical function which explains it the best. Firstly we should just create two plots:</p> <p><img src="https://griefberg.me/assets/images/how-to-calculate-ltv-cohorts/historical_values.png" alt="Historical values" /></p> <p>This data doesn’t look perfect, I know. But I still could see the patterns common to these metrics: some kind of an exponential decay of the retention and the growth of AGMPU over the cohort lifetime. And it seems quite logical to me: the most users who used your service once will stop using it after the first try and the users who continued to use your service will likely pay more because they will use it more.</p> <p>The next important step is to select appropriate math functions to fit these variables. Regarding the retention, the answer is quite obvious: people usually use the exponential decay. I personally use the mix of two exponential decays:</p> <script type="math/tex; mode=display">f(period) = a * e^{-\text{b } * \text{ period}} + (1 - a) * e^{-\text{c } * \text{ period}}</script> <p>How to fit AGMPU is the less obvious question. As for me, I use the following function (you also could use ln(x)):</p> <script type="math/tex; mode=display">f(period) = \frac{\text{a} * \text{period + b}}{\text{period + c}}</script> <p>When you’ve chosen appropriate functions, it’s time to evaluate constants a, b, c <a href="https://www.youtube.com/watch?v=E1XzT619Eug">minimizing the sum of squared errors</a> between the actual target values and predicted ones:</p> <script type="math/tex; mode=display">\sum_{i=1}^{N} (y_i - f_i(period))^2 \rightarrow \min\limits_{a, b, c} \\\ \text{where} \\\ \text{y – actual metric values (retention or AGMPU)} \\\ \text{f – predicted metric values (retention or AGMPU)}</script> <p>If it accidentally sounds complicated, then just imagine that we try to decrease our errors’ sum above by looking at different combination of our constants a, b, c.</p> <p>After finding optimal a, b, c and plotting resulted predictions, we get:</p> <p><img src="https://griefberg.me/assets/images/how-to-calculate-ltv-cohorts/predicted_values.png" alt="Predicted values" /></p> <h3 id="retention-rates-and-agmpu-prediction">Retention rates and AGMPU prediction</h3> <p>After we’ve fitted our retention and AGMPU, let’s choose the period we’re going to predict LTV for. Often this is quite clear: you just predict your retention for the next 5-10 years and usually you could see that the retention rate goes very close to zero from some point. In that case we just limit our LTV period when the retention rate first touch values close to zero. If we predict our retention for 5-10 years and see that a cohort doesn’t decay completely, then we need to decide how to limit the LTV calculation period. It should suits our calculation goal, often it’s enough just to know when our cohort becomes profitable.</p> <p>In the case of my dataset, I can see that the cohort disappears after 5 years:</p> <p><img src="https://griefberg.me/assets/images/how-to-calculate-ltv-cohorts/five_years_values.png" alt="5Y Predicted values" /></p> <h2 id="ltv-calculation">LTV calculation</h2> <p>Finally we’re ready to calculate LTV. We just sum predicted AGMPU multiplied by retention rate and divided by discount rate for every cohort’s life period (in my case 5 years). As a result I got that an average user from my dataset brings <strong>$40,06</strong> of lifetime profit to the company.</p> <p>How you can use this result?</p> <ul> <li>Check that your customer acquisition costs (CAC) are not higher than your customer LTV. It’s common to say that your ratio LTV:CAC should be from 3:1. But if it’s not lower than 1:1, then… At least, you don’t waste your money completely.</li> <li>Calculate LTV by different marketing channels. You can find that some of your campaigns are just useless while others really attract wealthy customers.</li> <li>Check the period when your customers’ sum of returns (used for LTV calculation) first becomes above CAC. This is your profitability period.</li> </ul> <p>That’s all I wanted to share. If you have any questions or corrections, feel free to write a comment.</p>GriefbergI hope you all read my previous post where I tried to understand a general LTV concept. Now let’s begin our investigation how to calculate LTV if we have enough historical data (at least, 1 year). Let’s remind us a general LTV formula from the previous post:Understanding Customer Lifetime Value (LTV)2018-01-27T16:00:00+00:002018-01-27T16:00:00+00:00https://griefberg.me/how-to-calculate-ltv<p>There are plenty of articles about LTV to get a general overview of this term. However, if you want to go deeper, it could be quite difficult to understand what’s behind these fancy formulas. That’s why after reading 10+ articles about LTV I decided to try to aggregate all acquired knowledge in this paper.</p> <h1 id="what-is-ltv">What is LTV?</h1> <p><strong>LTV</strong> is a sum of all returns a company expects to get from a customer during their current and future relationships. The easiest way to understand this concept in theory is to look at customers’ cohorts. This term originates from a demographic study and generally means people who made some action during some time period (for example, people who married in 2015). We also could define a cohort like all customers who made their first orders within a month. So, there are cohorts of December 2017, January 2018, etc. Now if you want to get a cohorts lifetime value you need to sum all expected returns from them:</p> <p>\begin{equation} \text{Cohort LTV} = \text{Cohort Gross Margin in Month 0} + … + \text{Cohort Gross Margin in Month N + …} \end{equation}</p> <p>Sure, you could sum revenue flows, but it’s more appropriate to use Gross Margin:</p> <p>\begin{equation} \text{ Gross Margin = Revenue } * \text { Gross Margin % = Revenue } * \frac{\text{Revenue - COGS}}{\text{Revenue}} \end{equation}</p> <p>You could see the formula for Gross Margin above, but what’s COGS? This is a <a href="https://en.wikipedia.org/wiki/Cost_of_goods_sold">cost of goods sold</a>. For example, your Gross Margin could be just the sum of the commission your company gets from every order.</p> <p>If you look at Cohorts LTV formula above, the whole point is that we couldn’t get accurate LTV but only try to predict it varying our aggregation assumptions. We could simplify it (and lose some accuracy) in the following way:</p> <p>\begin{equation} \text{ Cohort LTV = Cohort Size } * \text{R}_0 * \text{AGMPU}_0 + \text{Cohort Size} * \text{R}_1 * \text{AGMPU}_1 + \text{ …} \\ \text{where} \\ \text{Cohort Size – initial cohort size (e.g. 100 people who made their first order in some month, constant) } \\ \text{R}_n \text{ – cohort retention rate in month n (e.g 100 % in the 0th month, 35 % in the 1st month, etc.) } \\ \text{ AGMPU}_n \text{ – average gross margin per user in month n (e.g. 5 bucks in the 0th month, 11 bucks in 1st month, etc.) } \end{equation}</p> <p>Sometimes people discount it by rate to get the present value of future revenue (yes, 100 dollars next year ≠ 100 dollars today):</p> <p>\begin{equation} \text{ Cohort LTV = Cohort Size } * \text{R}_0 * \text{AGMPU}_0 + \text{Cohort Size} * \text{R}_1 * \frac{\text{AGMPU}_1}{\text{ (1 + d)}} + \text{ …} \\ \text{where} \\ \text{d - discount rate, i.e. the interest rate you can get if you put your money in the bank } \\ \text{ (by the way we don’t discount cash flow of 0th month) } \end{equation}</p> <p>The final step is just to get rid of this cohort terminology and talk about one customer. Just remove Cohort Size from the formula and we get Customer Lifetime Value:</p> <p>\begin{equation} \text{ Customer LTV = } \text{R}_0 * \text{AGMPU}_0 + \text{R}_1 * \frac{\text{AGMPU}_1}{\text{ (1 + d)}^1} + \text{ …} + \text{R}_n * \frac{\text{AGMPU}_n}{\text{ (1 + d)}^n} + \text{ … } \qquad \text{ (1) } \end{equation}</p> <p>Understanding <script type="math/tex">R_0, R_1</script> could be not clear without cohorts’ context, but just think about them as some kind of probability that a user will bring revenue in the n-th period of their customer life cycle.</p> <p>So, cool! We got it! Let’s move to a more practical side. Technically I see two ways of calculating LTV in practice depending of the accuracy we want to achieve and the data we have:</p> <ol> <li><strong>A simple but less accurate approach (more aggregations)</strong>: assume that R and AGMPU are <strong>constant</strong> over customer’s lifetime</li> <li><strong>A less simple but more accurate approach (less aggregations)</strong>: assume that R and AGMPU are <strong>not constant</strong> over customer’s lifetime</li> </ol> <h2 id="a-simple-but-less-accurate-approach">A simple but less accurate approach</h2> <p>Okay, let’s make the following assumptions:</p> <ul> <li><strong>AGMPU</strong> is a constant: every month a customer brings us the same amount of revenue (we can calculate it on historical data or just assume some amount)</li> <li><strong>R</strong> is a constant: each month R % of the last month customers continue using a service or simply saying each month the constant percent (1 - R) of customers churn (we can calculate it on historical data or assume)</li> </ul> <p>Assuming this, we get the following:</p> <p>\begin{equation} \text{ Customer LTV = } \text{AGMPU} + \text{R}^1 * \frac{\text{AGMPU}}{\text{(1 + d)}^1} + \text{ …} + \text{R}^n * \frac{\text{AGMPU}}{\text{ (1 + d)}^n} + \text{ … = } \sum_{i=0}^∞ \text{R}^i * \frac{\text{AGMPU}}{\text{ (1 + d)}^i} \qquad \text{ (2) } \end{equation}</p> <p>It means:</p> <ul> <li>In month 0 a customer brings us just AGMPU</li> <li>In month 1 a customer brings us discounted AGMPU with a probabiliy R</li> <li>In month 2 a customer brings us two times discounted AGMPU with a probabiliy R <script type="math/tex">*</script> R (e.g. it was 95 % retained users in the month 1, then in month it will be 90.25 % of them)</li> <li>In month 3 …</li> </ul> <p>Now look at the final formula (2) again. Wait, wait, wait. Something very familiar… Damn, this is <a href="https://en.wikipedia.org/wiki/Geometric_series">geometric series</a>! Common ratio is <script type="math/tex">\frac{\text{R}}{\text{ (1 + d)}}</script>, while AGMPU is the first term of the series. It means that we can use the following formula to calculate the total sum of (3):</p> <p>\begin{equation} \text{ Customer LTV = } \frac{\text{AGMPU}}{1 -\frac{\text{R}}{\text{1 + d}}} \text{ = } \frac{\text{AGMPU} * \text{(1+d)}}{\text{1 + d} - \text{R}} \qquad \text{ (3) } \end{equation}</p> <p>If we didn’t discount, we would get the following very common formula:</p> <p>\begin{equation} \text{ Customer LTV = } \frac{AGMPU}{1-R} = \frac{AGMPU}{\text{churn rate}} \end{equation}</p> <p>Manipulations above could be also explained using an <a href="https://en.wikipedia.org/wiki/Exponential_decay">exponential decay constant</a>, but, for me, geometric series is the clearest way. However, if you start reading some articles about LTV, you could find that some authors just mention it without any explanation.</p> <p>Let’s look at some example. Imagine that:</p> <ul> <li>the discount rate equals 2 % (US dicount rate currently)</li> <li>every month your company loses 5 % of your old customers, then a retention rate equals 95 %</li> <li>your average monthly gross margin per user equals$50</li> </ul> <p>Then your customer LTV will be the following:</p> <p>\begin{equation} \text{ Customer LTV = } \frac{$50 * (1+0.02)}{(1 + 0.02) - 0.95} =$728.57 \end{equation}</p> <p>Other useful recommendations regarding this approach (3):</p> <ul> <li><a href="http://tomtunguz.com/churn-fallacies/">Some people</a> also multiply the calculated LTV by a factor (ex. 0.75), because a churn rate could be higher in reality.</li> <li>If you try to calculate LTV for SaaS, then look <a href="http://www.forentrepreneurs.com/ltv/">here</a>. The logic is the same, you just need to make more assumptions.</li> <li>Use the above approach (3) if you have a lack of time or data.</li> </ul> <h2 id="a-less-simple-but-more-accurate-approach">A less simple but more accurate approach</h2> <p>The point of more complicated approach is just to take the formula (1) as it is. Don’t assume that a retention rate and AGMPU are constants. The algorithm will be the following:</p> <ol> <li>Calculate historical retention rates and AGMPU for cohorts (if you have historical data, otherwise, use the first approach above).</li> <li>Calculate average historical retention rates and AGMPU weighted on cohorts’ sizes.</li> <li>Fit statistical models for retention rates and AGMPU versus cohorts’ lifespan.</li> <li>Predict retention rates and AGMPU for the future using created models (ideally, you will find such an exponential function for a retention rate that it will go to zero after some lifetime).</li> <li>Calculate LTV using the formula (1)</li> </ol> <p>Yes, it sounds a bit complicated, but really it’s not. I will try to explain this approach more precisely in my <a href="https://griefberg.me/how-to-calculate-ltv-cohorts/">next post</a>. Tschüss!</p>GriefbergThere are plenty of articles about LTV to get a general overview of this term. However, if you want to go deeper, it could be quite difficult to understand what’s behind these fancy formulas. That’s why after reading 10+ articles about LTV I decided to try to aggregate all acquired knowledge in this paper.