Re: [Asrg] Unique innovations made to anti-spam system

Richard Clayton <richard@highwayman.com> Mon, 23 January 2006 20:30 UTC

Received: from localhost.cnri.reston.va.us ([127.0.0.1] helo=megatron.ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1F18Ka-0005qV-Je; Mon, 23 Jan 2006 15:30:28 -0500
Received: from odin.ietf.org ([132.151.1.176] helo=ietf.org) by megatron.ietf.org with esmtp (Exim 4.32) id 1F18KZ-0005qQ-LD for asrg@megatron.ietf.org; Mon, 23 Jan 2006 15:30:27 -0500
Received: from ietf-mx.ietf.org (ietf-mx [132.151.6.1]) by ietf.org (8.9.1a/8.9.1a) with ESMTP id PAA15832 for <asrg@ietf.org>; Mon, 23 Jan 2006 15:28:56 -0500 (EST)
Received: from anchor-post-35.mail.demon.net ([194.217.242.85]) by ietf-mx.ietf.org with esmtp (Exim 4.43) id 1F18Ty-0003Kh-NA for asrg@ietf.org; Mon, 23 Jan 2006 15:40:11 -0500
Received: from gti.noc.demon.net ([195.11.55.101] helo=happyday.al.cl.cam.ac.uk) by anchor-post-35.mail.demon.net with esmtp (Exim 4.42) id 1F18Cz-000OFJ-I0 for asrg@ietf.org; Mon, 23 Jan 2006 20:22:37 +0000
Message-ID: <j97rJ1iUyT1DFAYH@highwayman.com>
Date: Mon, 23 Jan 2006 20:29:08 +0000
To: asrg@ietf.org
From: Richard Clayton <richard@highwayman.com>
Subject: Re: [Asrg] Unique innovations made to anti-spam system
References: <cb84d2fe0601212215i5094f589leef0e29026d5cdcd@mail.gmail.com> <20060122142951.86098.qmail@simone.iecc.com> <cb84d2fe0601220733k592b1e5dn5fedb2035490a403@mail.gmail.com> <1060122182623.ZM26383@candle.brasslantern.com> <cb84d2fe0601221217n60347477i1cc7a4a52a3449a4@mail.gmail.com> <1060122220629.ZM26661@candle.brasslantern.com> <cb84d2fe0601221505r7562a9c4o4ff39785c23386b3@mail.gmail.com>
In-Reply-To: <cb84d2fe0601221505r7562a9c4o4ff39785c23386b3@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
X-Mailer: Turnpike Integrated Version 5.02 M <3t9$+j0b77$LvOKL9+c+dO7nrt>
X-Spam-Score: 1.3 (+)
X-Scan-Signature: 3f3e54d3c03ed638c06aa9fa6861237e
Content-Transfer-Encoding: quoted-printable
X-MIME-Autoconverted: from 8bit to quoted-printable by ietf.org id PAA15832
X-BeenThere: asrg@ietf.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Anti-Spam Research Group - IRTF <asrg.ietf.org>
List-Unsubscribe: <https://www1.ietf.org/mailman/listinfo/asrg>, <mailto:asrg-request@ietf.org?subject=unsubscribe>
List-Archive: <http://www1.ietf.org/pipermail/asrg>
List-Post: <mailto:asrg@ietf.org>
List-Help: <mailto:asrg-request@ietf.org?subject=help>
List-Subscribe: <https://www1.ietf.org/mailman/listinfo/asrg>, <mailto:asrg-request@ietf.org?subject=subscribe>
Sender: asrg-bounces@ietf.org
Errors-To: asrg-bounces@ietf.org

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In article <cb84d2fe0601221505r7562a9c4o4ff39785c23386b3@mail.gmail.com>
, Michael Kaplan <michaelkaplanasrg@gmail.com> writes

>    On 1/22/06, Bart Schaefer <schaefer@brasslantern.com> wrote: 
>>       On Jan 22,  3:17pm, Michael Kaplan wrote:

>>       Many reputable businesses send very large volumes of email.  If 
>>       it is
>>       economically infeasible for spammers to decode the CAPTCHAs, why 
>>       do you 
>>       believe it will be feasible for other businesses?

is it infeasible ?

Where is the evidence ?  I suggest spammers don't decode CAPTCHAs
because they are not yet widely employed... so there's no point.

As it happens, I think they are missing out unnecessarily...  I think
the main difficulty in dealing with CAPTCHAs is more the wide range of
systems offering them, rather than an inherent difficulty in solving
what is on offer today.

I've recently been receiving a lot of C-R response email (the Pharmacy
guys seem to like using my domain for their junk)...  and so I have
started looking at how easy it would be to process automatically.

   I'm currently corresponding with a handful of C-R users who object to
   my responding to the challenges ... apparently they don't think I'm
   behaving myself in arranging for them to read Pharmacy spam which
   they are too lazy to filter for themselves :( One even reported me to
   my own abuse@ address !

Anyway, a lot of the C-R's I am currently receiving merely require 3rd
Grade reading skills and the ability to reply to the email.  These could
be trivially automated since there is no perceptible variation in the
text that is presented :(

Some websites provide the challenge as text embedded in the page -- and
that is ever so easy to move to the POST response.

Most websites provide simple images that are trivial to process (there's
several other researchers breaking the trivial ones on a regular basis,
try Google -- at least one of the breakers is selling a service to soup
up your CAPTCHAs using the knowledge they've got from breaking others)

There was a paper at last year's CEAS showing that the hard part of
breaking text CAPTCHAs was the glyph separation -- after that computers
were better than humans at distinguishing mangled shapes!

Strong CAPTCHAs are currently the exception. However, if I was going to
process a lot of them I think I'd automate as far as possible and then
spend my money in the Third World...

But first don't forget to allow for stupidity -- a large ISP [from which
I have received several hundred C-R emails] has some pretty strong
looking text-based CAPTCHAs ... unfortunately they only have 30 of them!
so it's easy to provide a dictionary of responses :-(  I'm currently
trying to reverse-engineer how they select amongst the 30 because that
would make it even quicker to respond !

[BTW Kaplan's website has most of this information (though not the story
about the ISP with only 30 images). I also note that his CAPTCHAs are
not text based. I'd need to do some more work to comment as to whether
his stick-figures are genuinely harder to solve. They looked as if they
made some cultural assumptions that might not travel well.]

>    On my website I assume that the spammer would spend a tenth of a 
>    cent to manually decode a CAPTCHA and I demonstrate how this would 
>    be a crippling expense.

Just to be clear -- the tenth of a cent is the right sort of number.

A primary (grades 1-4) school headmistress in a Tamil Nadu (in rural
India), earns about $15 a week and is solidly in the middle classes.

The particular person I was told of (a colleagues relation) owned a nice
home (worth maybe $6000) in a salubrious leafy suburb.

So one could get appropriate skills for about $10 or so a week [labour
rates are higher for towns with broadband].  For a 50 hour week that
means you're paying about 20 cents an hour.

I've never tried solving CAPTCHAs at speed, so I couldn't predict how
fast I could do them for hours on end. But it looks to me that the cost
is definitely going to be in fractions of a cent/solution.

Of course you need to add in the cost of the connectivity and the kit
($100 laptops anybody?) but people who think of CAPTCHAs in terms of the
hourly charging rate of their attorney (or plumber!) are entirely
missing the point.


OK... so let's look at Kaplan's analysis which initially assumes that it
doesn't matter if all CAPTCHAs are broken for free:


His assumptions are:

   1 The email service provider can filter 95% of spam.
   2 The CAPTCHA is broken 100% of the time.
   3 The spammer has a 3:1 ratio of bogus to real email addresses.  
   4 A spammer sends 100 million emails using a valid return address.
   5 Users click on a "This Is Spam" button when spam arrives.

   He then shows that the spammer has to send 1600 emails to get one
   spam to its destination.

This sum is based on a key assumption which I think is incorrect.

He assumes that the spammer sends 1600 emails, and just 80 get through
the filter. This is not inconsistent with measured values for filters in
the real world. So far so good.

He then assumes that the spammer solves the 80 CAPTCHAs and resends.
This then results in a further attrition of 95% (ie 4 get through) and
only then is it discovered that 3 are bogus addresses and the final 1 is
delivered.

However, this is dumb by the spammer (and/or magical by the filter).

Why does the filter suddenly improve when the email is sent for the
second time (viz: it starts to discard 95% of the email that it approved
earlier ?).  Or -- same idea but different: why does the spammer send
something that is filterable at the first stage ?

It seems to me that the scheme (which is just filtering and nothing to
do with CAPTCHAs at this point) only ensures that the spammer must send
80 emails to get one delivered. (ie: it's 20x worse than Kaplan
proposes).

Kaplan has a second sum

Assumptions:

   6 A spammer must pay $0.001 per manually solved CAPTCHA
   7 The spammer wants to successfully deliver one million spam per day

He then calculates a cost of $80,000/day for getting the one million
spam emails delivered.

However, with the adjustment to the sums that I suggest is more
reasonable [not assuming that the filtering of the two stages is
independent] then to deliver one million spams then 80 million emails
must be sent and 4 million CAPTCHAs must be solved, costing $4,000/day

Kaplan multiplies his number by 365 to make it sound even bigger, but
this just obscures things....

... the question is whether the expense of solving the CAPTCHAs can be
afforded by the spammer.

Note that sending the emails is essentially free -- the spammer will use
zombies to send the emails via innocent (insecure) end users, so there's
no costs for electricity of bandwidth to worry about.

There's some consensus around a response rate (up to a couple of years
ago) of about 0.003% for spam (these figures come via journalists from
Laura Betterley and the Iraqi playing cards interviews).

So the 1 million delivered maps to about 30 customers a day. This means
that you'd need a profit margin of about $133 per sale to make spamming
worthwhile. That's quite a lot (though if you're selling fake pills or
Rolls Royces then you might still press on).

However, even in Betterley's day there was some filtering and spam
discarding going on -- so we're not comparing like with like.  The one
million spams are DELIVERED SPAM -- ie: they have got through the
filters and are sitting there waiting to be opened by the gullible.

If we assume the 0.003% came from a time when filters were 50% effective
(ie only about 50% of people had any) then the profit margin necessary
drops to $62/sale.

That's still a lot -- but if the spammer weeds out their list better
(only 25% valid addresses isn't too brilliant) then the required profit
margin would drop again.

Also [and this is key to spammer success], if they improved their
message (hire a Madison Avenue executive to teach them how to make their
advert more compelling) then the abysmal response rate would rise [[for
example, the Iraqi playing cards were 4 times more likely to be ordered
than the spammers usual fare of pills and toner cartridges.]].

BUT this is the spammer working the way the system wants him to. Why on
earth would he do that [even though he may do OK that way with high
profit margin goods, it's still eating into his lifestyle]

There's a much simpler approach that the clever spammer would take.

Instead of solving CAPTCHAs to send spam, he would solve CAPTCHAs to
acquire a valid sub-address.  Once he had this, he would then send as
many different pieces of spam as possible as fast as possible to this
sub-address.  He'd advertise pills and mortgages and anatomy enhancers
and lotto winnings and poker sites and... etc.  ((or he could just sell
it to fellow spammers and they would send the spam...))

Viz: he'd get more than one email delivered per validated sub-address

Clearly there are things that could be done to improve end-user software
to counter this, but in the meantime, profitability would be restored.

Bottom line is that I agree that the CAPTCHAs raise spammers costs, but
I don't agree that they do anything more than freeze out low profit
margin spam (and make the pills more likely to be fake).

Even if challenge-response systems were perfect, I'd not be in favour
because of the damage to innocent third parties. But they are not (on
these assumptions) as effective as claimed :(

Plus of course there are other objections as put forward by others, but
I wanted to concentrate on the economics because I've written about
these before (in the context of proof-of-work schemes)

   http://www.cl.cam.ac.uk/~rnc1/proofwork2.pdf

and most of the analysis carries over just fine.

>    Let's assume that over the course of a year Amazon.com emails 10 
>    million customers.  I'll say that 5% of these sub-addresses are 
>    deactivated without the customers bothering to notify amazon.  I'll 
>    say that it costs Amazon 5 cents to decode a CAPTCHA (fifty times 
>    as expensive as what I assumed the spammer would have to pay!).

actually Amazon are experimenting with the Mechanical Turk... so they
might be able to manage Third World rates :)

   http://www.mturk.com

[ ah I see that Bart also spotted the double application of the
filtering stage ]

>>       Further, I'd dispute that applying two 95%-effective spam 
>>       filters has
>>       a net 99.75% success rate.  
>     
>    Very well

hmm... I think it needs more than that as a reply :(

- -- 
richard                                                   Richard Clayton

Those who would give up essential Liberty, to purchase a little temporary 
Safety, deserve neither Liberty nor Safety. Benjamin Franklin 11 Nov 1755

-----BEGIN PGP SIGNATURE-----
Version: PGPsdk version 1.7.1

iQA/AwUBQ9U8lJoAxkTY1oPiEQLnHACdFqWzT0DPK8AJFjR78jcK2zwoh3EAnj2l
Yg8Crojbjn9/6qgtf+d+q79D
=p+iB
-----END PGP SIGNATURE-----

_______________________________________________
Asrg mailing list
Asrg@ietf.org
https://www1.ietf.org/mailman/listinfo/asrg