I've started reading spam. It's not glamorous but there sure is a lot of it to read. When you start reading spam you quickly see some pattern.
- 98.9% of spam comes from people you don't know. About .005% is from a friend who forwards me jokes or pyramid schemes. The remaining 1% is from websites that I'm registered too that send holiday emails.
- Spam communicates it's message in two core ways:
- Links to websites
- Images within the text that direct users to call, email or open a web-page
- Some messages have no link or actual content and seems to be trying to confuse spam blocking software so that future messages can get through.
- The text of a message tends to be semi-random meaning each message is semi-unique. Sometimes words have random characters, sometimes there is non-sense words and other times it seems to be random text copied from a random website.
I realized that modern spam filters have some serious problems:
- Bayesian filters will get tricked by random messages and content taken from the web
- Community filtering can't flag all messages if the content is semi-random
- Image filters can't recognize the text in an image if the image is random and the text in the image is sometimes random.
- Dynamic blacklisting can reduce spam but not eliminate it.
How about an alternative?
I'll try to avoid the term 'filter' because this word implies that messages can get through the filter by tailoring content. No matter how smart the filter gets it's possible to craft messages that will pass through. Any system that relies on 'spam filters' will have both false positives and false negatives.
Instead of filters let's start with a basic trust system.
What's a trust system? Ebay uses a good example of a trust system. Users establish trust and reputation over time. Anyone can give feedback on the trust relationship. The longer you have an Ebay account the more your trust grows over time. The same should be true for email.
There are many ways to build such a system but the easiest way to explain it is by comparing it to the Google page rank technique. Assume that every email I send is like an outbound link. Every email I receive is an inbound link. With your own email data you could compute the email rank of any individual. With an aggregated data set you could compute the email rank across a very wide range of people who use email.
Each email you receive allows you to give basic rating of positive and negative. Really simple.
New Users and Mailing Lists
When a new user enters the system they would have an email rank of zero. Their initiation into email would be to build a trust relationship with someone who has a higher rank. As the email is used the persons reputation grows and the validity of their messages increases.
Mailing lists that people subscribe to would have to build the same type of trust. As these mailing lists are used and as more people subscribe/opt-in the trust of the list grows. At Raizlabs we have a mailing list product and I'm always nervous that software designed to be used as a useful tool for news and opt-in mail could be used by someone to send out spam.
Why is it that I know more about the trustworthiness of random people on Ebay then I do of the trustworthiness of email in my own inbox?