Network Working Group D. Crocker Internet Draft Brandenburg 28 Apr 2003 Expires: <10-04> Technical Considerations for Spam Control Mechanisms draft-crocker-spam-techconsider-00.txt This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright (C) The Internet Society (2003). All Rights Reserved. SUMMARY Internet mail has operated as an open and unfettered channel between originator and recipient. This invites some abuses, called spam, such as burdening recipients with unwanted commercial email. Spam has become an extremely serious problem, is getting much worse, and is proving difficult (or impossible) to eliminate. The most practical goal is to bring spam under reasonable control; it will require an on-going, adaptive effort, with stochastic rather than complete results. This note discusses available points of control in the Internet mail architecture, considerations in using any of those points, and opportunities for creating Internet standards to aid in spam control efforts. It offers guidance about likely trade-offs (benefits and limitations.) CONTENTS 1. Spam And Consent 2. Email Architecture Control Points 3. Administrative And Legal Mechanisms 4. Filtering 4.1. Policies 4.2. Explicit Lists 4.3. Content Analysis 4.4. Negotiation 4.5. Traffic Analysis 5. Infrastructure Enhancement 6. Evaluating Technical Approaches 6.1. Adoption 6.2. Burden 6.3. Scaling 6.4. Scenarios 7. Security Considerations 8. Acknowledgements 9. AuthorsĘ Addresses 1. SPAM AND CONSENT Internet mail has operated as an open and unfettered channel between originator and recipient. It has always suffered from some degree of abuse, in which originators impose on recipients inappropriately. In recent years, a version of this abuse has grown substantially. Called spam, its definition varies from "unsolicited commercial email" to "any email the recipient does not want". Often there are no technical differences between spam and "acceptable" email. Their format, content and even aggregate traffic patterns may be identical. Hence spam is a problem for fundamentally non-technical reasons, yet the Internet technical community must pursue technical responses to it. The lack of strong community consensus on a single, precise definition makes this particularly challenging. For most working discussions, the term "Unsolicited Bulk Email" is sufficient. The salient point is that it is mass-mailings that are of the broadest concern. More detailed discussion must, of course, be precise in the definition of "unsolicited" and usually must distinguish between different types of mail, such as commercial, religious, political or personal. The simplistic -- but entirely adequate -- summary of the impact of spam on Internet mail is that it is an extremely serious problem, it is getting much worse, and it is proving difficult (or impossible) to eliminate. Spam is generated by a wide range of clever originators and it always will be. Instead of thinking of Spam as a disease that might be eliminated, it is more useful to think about crime, war and cockroaches. It is not realistic to expect to eliminate any of these, no matter how much anyone might wish otherwise. Therefore the best we can hope to accomplish is to bring spam under reasonable control and that control will require an on-going, adaptive effort, with stochastic rather than complete results. We need multiple, adaptive techniques. As spam changes, so must our mechanisms. Different sets of mechanisms will be appropriate for different circumstances. In other words spam has become a permanent part of the Internet mail experience and efforts to control it may only reduce it to a tolerable level, rather than eliminate it. It is somewhat comforting to remember that an individual spam is not damaging. Rather it is the quantity of spam that poses a threat. Therefore it is acceptable for spam control mechanisms to be imperfect. This note discusses available points of control in the Internet mail architecture, considerations in their use, definitions of terminology and opportunities for creating Internet standards. It also offers guidance about likely trade-offs (benefits and limitations.) The note does not offer an analysis of the types of spam or the types of attacks used in sending spam, nor is it intended to specify solutions. Similarly, the note does not discuss fine-grained details, such as the arguments associated with single opt-in mechanisms, versus double opt-in. These points are essential to the engineering of particular solutions, but only as refinements after the larger architectural and system control choices are made. COMMENT: This document is intended to evolve, based on feedback. Comments are eagerly sought, preferably in the form of suggested text changes, and preferably on the ASRG mailing list, at 2. EMAIL ARCHITECTURE CONTROL POINTS Email transmission sequences can touch many systems, between the originator and the recipient. However for most discussions about control, only five major components are important: Originator Intermediary Recipient Service Service Service +---------------+ +---------------+ | UA.o -> MTA.o | -> ISP.i -> | MTA.r -> UA.r | +---------------+ +---------------+ UA.o: The originator's user agent, typically operated by the user and under their direct control MTA.o: The mail transfer agent service associated with the originator's environment, possibly operated by the sender and possibly operated under separate control, such as by their employer. ISP.i: The IP and/or mail transfer agent service(s) operated by independent third-part(ies). MTA.r: The mail transfer agent service associated with the recipient's environment UA.r: The recipient's user agent In many organizations, the MTA service is multi-stage, such as including a department MTA and an Internet "firewall" MTA. This distinction is of fundamental importance for making software and operations decisions, but it does not have a significant impact on a discussion about points of control. By contrast, the distinction between originator's service, recipient's service and any independent third parties is essential to this larger examination. These are separate, independent administrative environments and are subject to different policies. In particular, note that a discussion about using control points hinges on the scope of the control to be exercised. Besides constituting a major burden to recipients, the volume of spam traffic has become a serious problem for transit services. Hence a precept in controlling spam is to seek control as close to the source as possible. The fewer downstream resources consumed by spam, the better. Of course the ideal would be a mechanism in UA.o that would prevent spam from being sent in the first place. Indeed, legal remedies seek to affect a sender's motivations, so that they will not send the spam at all. Unfortunately software control of spam in UA.o cannot be assumed, because that software is usually under the control of the originator. If they wish to bypass any control mechanisms in UA.o, they will find a way. The same may be true of MTA.o. Hence Internet-wide designs of spam control must assume that UA.o and MTA.o may cooperate to generate and transmit spam. Efforts to control either of these components may be sought as an adjunct, where they are operated by an independent service, but it must not be relied on. Wherever the detection mechanism is placed, the critical challenge is to identify spam in real time, if its relaying and delivery are to be stopped. The other avenue is post-hoc removal of the right to make further use of the MTA service. This may have strong utility for controlling spammers needing to operate within acceptable social bounds. It will have no effect upon spammers who avoid accountability. 3. ADMINISTRATIVE AND LEGAL MECHANISMS Both government law and service provider contracts can be used for defining unacceptable behavior and the remedies available when there are violations. There are two major problems with this administrative control of spam. One is that a spammer often cannot be identified. There are many opportunities for anonymous posting of email, such as through Internet cafes, transient access services and free email services. The second problem is that the sender of spam may not be in the jurisdiction seeking to exercise control, or a jurisdiction responsive to the recipient's jurisdiction. The Internet is global. Unlike postal bulk mail, the cost of sending spam over the Internet does not change as the mail crosses jurisdictional boundaries. Hence it seems likely that use of administrative procedures can be effective for controlling "responsible" spam. That is, spam sent by organizations operating as accountable social participants, perhaps indulging in overly aggressive policies, but still desiring to remain socially tolerable. The large number of "rogue" spammers is not similarly burdened. 4. FILTERING The technical mechanism for real-time detection and handling of spam is a filter, placed at ISP.i, MTA.r and/or UA.r. A filter has two functions: qualification and action. Action is usually either adding a special label to the message or disposing of it. Qualification tests whether a message is spam. Test results are: Positive: Message matches the test criteria. Negative: Message fails to match the test criteria. When the tests are heuristic or statistical, some portion of the results will be incorrect. These are classed as: False Message matches test criteria, Positive but the criteria are too (FP): aggressive. False Message fails to match the test Negative criteria, but the criteria are (FP): not sufficiently strong. Filters are used for two, complementary policies: Acceptance: Approves mail for delivery. Rejection: Withholds or refuses permission for relaying or delivery. Note that rules for acceptance are equally subject to error. However Acceptance rules usually employ simple, explicit criteria rather than heuristics, so that FP and FN results are not usually a concern. Hence FP and FN discussion is usually about Rejection rules. 4.1. Policies The simplest model for an assessment list is to have entries containing a single, simple attribute, such as sender email address or source system IP address or domain name. Standards 1. Control protocol between Opportunity: recipient and filtering service server, to permit specifying policies and specific rules. 2. Modify SMTP delivery status notifications to avoid flooding innocent mailboxes because of forged senders. [Needs clarification. /ed] 3. Codify best current practices of filters to minimize sending DSN. [Cited by VS; needs clarification. /ed] 4. Codify DSN and SMTP status message wording, such as saying that rejections resulting from filtering should include a URL for an extended explanation. [Needs clarification. /ed] 5. Replace SMTP. The idea of replacing SMTP is appealing because it permits thinking in terms of creating an infrastructure that has accountability and restrictions built in. Unfortunately an installed base the size of the Internet is not likely to make such a change anytime soon. It seems far more likely that successful spam control mechanisms will be introduced as increments to the existing Internet mail service. 4.2. Explicit Lists The simplest method of testing is to have explicit lists of simple identifier criteria, such a From address or IP address. Pre-assessed senders are entered into a: Whitelist: For automatic Acceptance Blacklist: For automatic Rejection. One approach to maintaining Whitelists and Blacklists is to make explicit entries into them, manually. This is often what a spam control service will propagate to its subscribers. Most such services are for Blacklisting "known" spammers. A difficulty with listing services is the set of criteria used for adding and removing senders or sites. These policies usually need to be explicit, objective and documented, as well as consistently applied. Even then they are attractive targets for lawsuits claiming inappropriate listing. For assessments based on the identity of the sender, rather than the content of the message, another concern is validation of the key attribute used for identification. What if the value for that attribute is set falsely? For example, what if email was not sent by the address listed in the From field? Standards 6. List format and exchange, to Opportunity: permit sharing Whitelist and Blacklist entries 7. Format and access to filter logs, such as among MX secondaries. [Suggested by VS; needs clarification. /ed] 4.3. Content Analysis Filters look for message attributes, such as strings of text in the headers or content of the message being inspected. Other attributes include the address or domain name of the originating system, or the occurrence of the same message content in multiple messages near the same time. Simple filters look for any occurrence of specific strings. A more powerful approach to content analysis looks for multiple sets of these strings, assigns a score to each occurrence; it then labels spam according to the aggregate score. Rule creation is done manually, or by a service, or by analysis of a known corpus of messages. A service observes email traffic at many Internet locations and receives reports as recipients see new occurrences of spam. The service then propagates new rules to its subscribers. The analytic approach performs empirical rule creation, using statistical (Bayesian) techniques that discern string occurrences in known spam, versus mail that is known not to be spam. As rules become common, spammers adapt their messages to bypass filters, so that existing rules quickly become far less effective. Hence long-term filter use must have a base of rules that is continually modified. Empirical rules generation must be repeated, or must operate continuously, analyzing all incoming mail. Manual rule maintenance is simply not viable for typical users; the effort is far too great. A concern about services is that they are inherently post-hoc. They are always updating the rule-set after an "attack" commences, so that some spam is certain to reach some recipients; however the view that a small amount of spam is not dangerous mitigates this concern. Lastly, methods using automated analysis rely on heuristics, or guesses. They are certain to have some FNs that permit real spam to reach the recipients, and some FPs that incorrectly label legitimate mail as spam. Any effective, long term filtering mechanism must have automatic or semi-automatic rule creation and must upgrade the set of rules continuously or periodically. Standards 8. Rule format and exchange, to Opportunity: permit sharing effective rules. 9. Sample message labeling and exchange, to permit submission of candidate content to remote service 10. Hash-based identifier of content 4.4. Negotiation In addition to real-time analysis, a recipient may engage in an explicit negotiation with the sender, to validate them. When this is performed at the time of message receipt, it is called a Challenge-Response (CR) mechanism. CR introduces delay in message receipt and creates at least one additional email round-trip exchange for every new sender/recipient pair. This is a substantial burden both on participants and on the transit service. Senders often refuse to respond to the challenge, so that the mechanism dissuades senders from all but the most urgent communications. Also the delay imposed by CR can render time-sensitive messages useless. As with other forms of Internet-based attack, effort is often divided into two phases. The first assesses details about the target and the second uses them. For spam, the assessment phase of the process seeks to discover valid email addresses. CR mechanisms suffer from providing that validation. Standards 11. CR protocol, to permit Opportunity: automated interaction between the recipient's system and the sender's system. 4.5. Traffic Analysis Spam is often referred to as "unsolicited bulk mail" to highlight that senders typically post very large amounts quickly. Opt-in (subscription) email also demonstrates this traffic pattern. Still there is benefit in measuring aggregate email behavior. Standards 12. Traffic reporting protocol, Opportunity: to permit collaboration among independent administrations. 5. INFRASTRUCTURE ENHANCEMENT Enhancement of underlying Internet services might reduce the effectiveness of some spam transmission mechanisms. For example many spammers prefer to send to domain name service MX secondaries because secondaries are often not as well filtered as MX primaries. Because of the lack of MX secondary coordination protocols, the best advice for all but large sites is to stop using MX secondaries. Standards 13. MX secondary coordination Opportunity: protocol. [Suggested by VS; might need clarification. /ed] 14. Best Current Practises (BCP) documentation of preferred MTA operation for spam control 15. BCPs for other services operating to control spam Postal mail imposes a fee on the sender for each message that is sent. Such a fee makes the cost of sending significant, and proportional to the amount sent. In contrast, current Internet mail is very nearly free to the sender. Hence there is interest in exploring "sender pays" email. One form of sender-pays is identical to postal stamping. Another entails "retribution" to the sender, taking the fee for their posting only if the recipient indicates they were unhappy to receive it. For both models, it is not clear that it is possible to fit the necessary mechanisms to existing Internet mail. Its complete absence from the current service and the existence of anonymous and free email services may have too much operational inertia. It is also not clear who should accrue the revenues or how they should be disbursed. Standards 16. Billing and accounting Opportunity: protocols to obtain sender fees and track them. 6. EVALUATING TECHNICAL APPROACHES The complexity of Internet mail service and the nature of spam make it difficult evaluate proposals for control mechanisms. In this section, the key technical factors affecting viability are examined. 6.1. Adoption A critical barrier to the success of a new mechanism is the effort it takes to begin using it. It is essential to look carefully at the adoption process. What will it take for someone to start using the proposed mechanism? What will it take for that person to get some benefit from the mechanism? For example, how many people and/or systems must adopt it before it provides any benefit? A key construct to this issue is "core-vs-edge". For Internet-scale operations, adoption at the edge of a system is typically easier and quicker than adoption in the core. If a mechanism affects the core (infrastructure) then it usually must be adopted by most or all of the infrastructure before it provides meaningful utility. In something the scale of the Internet, it can take decades to reach that level of adoption, if it ever does. For localized operations, adoption in the core might be quicker, involving a single administrative entity, rather than an array of independent users. Remember that the Internet comprises a massive number of independent administrations, each with their own politics and funding. What is important and feasible to one might be neither to another. If the latter administration is in the handling path for a spam, then it will not have implemented the necessary control mechanism. Worse, it well might not be possible to change this. For example a proposal that requires a brand new mail service is not likely to gain much traction. By contrast, some "edge" mechanisms provide utility to the first one, two or three adopters who interact with each other. No one else is needed for the adopters to gain some benefit. Each additional adopter makes the total system incrementally more useful. For example a filter can be useful to the first recipient to adopt it. A consent mechanism can be useful to the first two or three adopters, depending upon the design of the mechanism. Obviously another concern is the effort it takes to continue using the mechanism. That is, once a use has chosen to make the change to adopt a mechanism, how much effort does it take to use it regularly? Equally, the impact on others is important. For example, a challenge-response system is irritating for the person being challenged, and it imposes extra delay on the desired communication. If the originator and the recipient both access the Internet only occasionally (such as through dial-up when mobile) a challenge- response model can impose days of delay. For some communications, this can be disastrous. 6.2. Burden The purpose of spam control is to cause some email to fail to reach its intended destination. This is, of course, directly at odds with the constructive goal of email. Hence spam control alters the basic model of email service. Effective mechanisms will place some kind of burden on senders and receivers. Hence a challenge for spam control mechanisms is to require enough of a burden to be effective, but not so much that it makes email unacceptably painful to use. When evaluating proposals, the nature and distribution of these burdens must be considered carefully. 6.3. Scaling How does the proposal scale? What happens if everyone on the Internet engages in a particular behavior? What if the Internet grows by a factor of 1000? Remember that "everyone" is approximately 100 million users today, and should be expected to grow to 10 billion, if we expect the Internet to be useful for some decades. And it is likely there will be more email users/accounts that there are people on the planet, given that individuals and organizations occupy multiple roles. So, what will it be like for 100 million or 10 billion users to employ the proposed mechanism? The other side of the scaling question is to ask how much of the Internet will be affected by a proposal and, therefore, how much spam will be controlled by it? If a proposal requires substantial effort to adopt and use, but will affect only a small percentage of spam, the efficacy of that proposed mechanism is very much in question. An obvious example of this concern is legal scope, given that spam is global and there is no global law enforcement. 6.4. Scenarios Almost any proposal will make sense for a particular scenario that is sufficiently constrained. The real test is how the proposal works for other, likely scenarios. Make sure the proposal considers these likely cases carefully. For example, citing the scenario of mailing list participation is an excellent test. There are many others. 7. SECURITY CONSIDERATIONS This note discusses types of mechanisms for evaluating and filtering email. As such it covers topics with extremely sensitive security concerns. However it does not propose any standards and therefore does not have any direct security effects. 8. ACKNOWLEDGEMENTS This note is motivate by discussions on the Anti-Spam Research Group (ASRG) mailing list and draws a number of points from discussion there. A number of Standards Opportunity suggestions were taken from an ASRG posting by Vernon Schryver. The sub-section "Burden" is taken from a posting by Dave Hendricks. 9. AUTHORSĘ ADDRESSES Dave Crocker Brandenburg InternetWorking 675 Spruce Drive Sunnyvale, CA 94086 USA Tel: +1.408.246.8253 dcrocker@brandenburg.com 10. FULL COPYRIGHT STATEMENT Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.