Home > General Security, Risk > A Matter of Probability

A Matter of Probability

November 21, 2010

I was reading an article in Information Week on some scary security thing, and I got to the one and only comment on the post:

Most Individuals and Orgs Enjoy "Security" as a Matter of Luck

Comment by janice33rpm Nov 16, 2010, 13:24 PM EST

I know the perception, there are so many opportunities to well, improve our security, that people think it’s a miracle that a TJX style breach hasn’t occurred to them a hundred times over and it’s only a matter of imagetime.  But the breech data paints a different story than “luck”. 

As I thought about it, that word “luck” got stuck in my brain like some bad 80’s tune mentioned on twitter.  I started to question, what did “lucky” really mean?  People who win while gambling could be “lucky”, lottery winners are certainly “lucky”.  Let’s assume that lucky then means beating the odds for some favorable outcome, and unlucky means unfavorable, but still defying the odds.  If my definition is correct then the statement in the comment is a paradox.  “Most” of anything cannot be lucky, if most people who played poker won then it wouldn’t be lucky to win, it would just be unlucky to lose.  But I digress. 

I wanted to understand just how “lucky” or “unlucky” companies are as far as security, so I did some research.  According to Wolfram Alpha there are just over 23 Million businesses in the 50 United States and I consider being listed in something like datalossDB.org would indicate a measurement of “not enjoying security” (security fail).  Using three years from 2007-2009, I pulled the number of unique businesses from the year-end reports on datalossDB.org (321, 431 and 251).  Which means that a registered US company has about a 1 in 68,000 chance of ending up on datalossDB.org.  I would not call those not listed as “lucky”, that would be like saying someone is “lucky” if they don’t get a straight flush dealt to them in 5-card poker (1 in 65,000 chance of that)

But this didn’t sit right with me.  That was a whole lot of companies and most of them could just be a company on paper and not be on the internet.  I turned to the IRS tax stats, they showed that in 2007, 5.8 million companies filed returns.  Of those about 1 million listed zero assets, meaning they are probably not on the internet in any measurable way.  Now we have a much more realistic number, 4,852,748 businesses in 2007 listed some assets to the IRS.  If we assume that all the companies in dataloss DB file a return, that there is a 1 in 14,471 chance for a US company to suffer a PII breach in a year (and be listed in the dataloss DB).

Let’s put this in perspective, based on the odds in a year of a US company with assets appearing on dataloss DB being 1 in 14,471:

Aside from being really curious what constitutes as a grooming device, I didn’t want to stop there, so let’s remove a major chunk of companies whose reported assets were under $500,000. 3.8 million companies listed less then $500k in their returns to the IRS in 2007, so that leaves 982,123 companies in the US with assets over $500k.  I am just going to assume that those “small” companies aren’t showing in the dataloss stats.

Based on being a US Company with over $500,000 in assets and appearing in dataloss DB at least once (1 in 2,928):

Therefore, I think it’s paradoxically safe to say:

Most Individuals do not participate in a non-traditional triathlon as a Matter of Luck.

Truth is, it all goes down to probability, specifically the probability of a targeted threat event occurring.  In spite of that threat event being driven by an adaptive adversary, the actions of people occur with some measurable frequency.   The examples here are pretty good at explaining this point.  Crimes are committed by adaptive adversaries as well, and we can see that about one out of every 2,500 Hispanic females 12 or older, will experience a loss event from purse-snatching or pickpocketing per year.  In spite of being able to make conscious decisions, those adversaries commit these actions with astonishing predictability.  Let’s face it, while there appears to be randomness on why everyone hasn’t has been pwned to the bone, the truth is in the numbers and it’s all about understanding the probability.

  1. patrick florer
    November 22, 2010 at 8:34 am

    Interesting, Jay –

    But –

    To calculate a meaningful probability you need a numerator and a denominator that you believe.

    I can believe the denominator from Wolfram Alpha – 23 million companies, and the process by which you reduce this to approx. 4.7 million seems sound to me.

    It’s the numerator that’s the problem – while the information in datalossdb is invaluable, it’s also quite biased – only those companies with publically reported breaches are listed.

    We all know/suspect that what datalossdb contains is just the tip of the iceberg.

    Therefore, the numerator of breached companies could be off by one or more orders of magnitude, which would change the probability quite a bit.

    Using your calculation, 1 in 14K becomes 1 in 1.4K at one order of magnitude, and 1 in 140 at two orders of magnitude.

    The other problem, it seems to me, is that some companies much more lucrative targets, and consequently have a much higher probability of being attacked.
    In this age of increasingly sophisticated attack methods, and automation of attacks, it’s very hard for me to correlate any meaningful frequency number with anything.

    Just some thoughts –
    Best regards,
    Patrick Florer

  2. November 22, 2010 at 9:34 am

    @Patrick – all great points. I was very careful to say that those stats were only to appear in the dataloss db, I didn’t want to make the leap that it was the only record of breeches.
    About 3/4 of the way through I thought of adjusting to account for other sources in the numerator, like the Verizion DBIR and USSS data, which I think was like 900 attacks in a year. And then do some rough estimations about companies who don’t report, etc. But then I wouldn’t be able to mention the accidents by grooming devices. It was an editorial decision.
    I think it would be interesting to spend more time on it and come up with at least a plausible estimation on a probability of of loss event, but as you pointed out it is difficult/problematic to treat all companies as equal targets.
    Thanks for the thoughts!

  3. Patrick Florer
    November 22, 2010 at 10:07 am

    @Jay –

    It’s interesting that datalossdb would come up today –

    I download datalossdb every now and then, most recently last Friday (the MySql version), and was prompted by your post to do a bit of analytics.

    (BTW – I am a financial contributor to the datalossdb project – just a small one @$20/mo – I encourage everyone who uses these data and maybe takes these folks for granted to contribute a little something now and then!)

    Here is the current state of things, as of the datalossdb update on 11/17/2010:

    Please note that I have factored out the 948 rows where no number of records has been reported – 30% missing values is about right for what I have seen over the past couple of years. For those who may be new to datalossdb, zero records are reported when either: 1) the number has not been reported; or 2) the number cannot be estimated.

    Please also note that, even when the number of records has been reported, in most cases it is an estimate – the bad guys don’t usually leave a note that says “Thanks for the 121,713 records”.

    Total Rows: 3,096
    Rows where records = 0: 948 (30.6%)
    Rows where records 0: 2,149 (69.4%)

    Statistics based on non-zero records lost:
    Average: 375,180
    Median: 2,500

    1st: 2
    5th: 20
    10th: 57
    20th: 200
    50th: 2,500
    70th: 12,000
    80th: 34,863
    90th: 120,000
    95th: 314,000
    99th: 4,000,000

    Minimum records: 1
    Maximum records: 130,000,000 (Heartland)

    As you can see, the data are extremely skewed to the low side – just look at difference between the mean and the median! And 95% of reported records lost are <314K.

    I am also doing some work to see if I can fit a meaningful parametric or non-parametric distribution to these data, but so far, no real luck.



  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: