Forum OpenACS Q&A: Encoded Email for Webpages

Collapse
Posted by MaineBob OConnor on

I'm looking for a TCL function that I can use to encode email addresses on web pages so that they are less likely to be harvested by webpage robots that look for email addresses to target for s-p-a-m lists.

Here is my plain email address:

bob@abc.us

and here it is encoded:

&098;ob@ab&#09 9;.us

This is really cool. You can have your '<a href="mailto:...' on your webpage with the above encodeing and clicking on the link will open your email client with the properly translated code yet the email harvesters won't see an email address (@).

Do we have such an encoding function already available?

If not, I was planning to write one using:

string map -nocase {
            "&097;"     "a"
            "&098;"     "b"
            "&099;"     "c"
            {more not shown)
        } $email

Or maybe one can suggest a more elegant function.

-Bob

Collapse
Posted by Hanjo Pepoulis on
Hi Bob,

you may want some letters of Basic Latin U+0041 - U+007A Just give it a try, but it should work:

set email "steve-o@jackass.something.com"
regsub -all {([u0041-u007A])} $email {&#[scan  %c];} result
puts [subst $result]

Results in:

steve-o@jackass.something.com
(Look at the HTML source of this message to see it 😊 )

Hanjo.

Collapse
Posted by Hanjo Pepoulis on
I try it again, the board deleted some chars:

regsub -all {([u0041-u007A])} $email {&#[scan  %c];} result

Collapse
Posted by Hanjo Pepoulis on
Ok, Bob, me again 😊

To get the @ encoded, please start somewhere below U+0041... I don't have the character encodings lying around me right now.

Collapse
Posted by G. Armour Van Horn on
Why not *only* change the "@" and "." characters? While my normal mail address would be recognized by robotic address suckers, but
     vanhorn[at]whidbey[dot]com
would not be harvested by the robots and would still be recognizable to most human readers.
Collapse
Posted by Hanjo Pepoulis on

Hi G.,

Why not *only* change the "@" and "." characters? While my normal mail address would be recognized by robotic address suckers, but     

vanhorn[at]whidbey[dot]com

would not be harvested by the robots and would still be recognizable to most human readers.

It will be recognizable to most, but not all human readers.

Some people recommend to not use mailto: arguments at all in situations where you want mail clients to be fired up but instead use something like a little script or redirects to mail form pages or or or...

It's somewhat useless to try to think like a spam spider but when I had to code one I would certainly try to integrate the "tricks" above 😊

Collapse
7: Redirect-mailto: Trick (response to 1)
Posted by James Thornton on
I came up with this "Redirect-mailto:" trick that reduces spam by separating e-mail addresses from Web pages while still providing a way for the user to click an e-mail address link and have it open their local mailer.

Mail link:

<a href=/email/james>james@jamesthornton.com</a>

File /email/james:

ns_returnredirect "mailto:james@jamesthornton.com"

Example: james@electricspeed.com

For extra protection, consider putting the /email path in your robots.txt file to exclude robots from crawling it and grabbing the output of the link. You could even include some code that compares the user agent to those listed in http://www.robotstxt.org/wc/active/all.txt, and if it is a robot, return nothing.

Even better, write a proc such as, [jt_email "james@jamesthornton.com"] or [ad_email $user_id] that compares the user agent to the list, and if it is a user agent, then don't return anything (not even the part between the closing and end anchor tag). I started writing this proc a while back, but I can't find -- if I find it, I'll post it.

Collapse
Posted by MaineBob OConnor on

Great answers, thanks.

Hanjo, nice simple code to solve my question.

G. Armour, even simpler with only encoding "." and "@". Hanjo thinks it might not be readable yet here is a line that I think harvesters would miss yet humans who view the webpage would see and clicking the link would open the email client. I think most email clients convert the encoding:

<a href="mailto:bob&#064;abc&#046;us">bob (at) abc.us</a>

Here is the line again as an actual link to test your email client. (People getting email alerts may not see this properly.... go to the web page version to see it as Bob intended.)

bob (at) abc.us

James, your example still has the "@" in it. and the redirect may add unnecessary complexity. I'm not concerned about standard search engine robots, but the software to scan a whole website for the express purpose of harvesting email addresses from our users. Heck, most active openacs people could write such a script.

Oh-ohh, I hope that these ideas don't get well distributed or the new version of the email "International Harvester" might also use these tricks!

-Bob

Collapse
Posted by Matthew Burke on
It was my impression that many email harvestors have already cobbled onto a number of these tricks....
Collapse
Posted by James Thornton on
Consider using a graphical "@": http://jamesthornton.com/software/graphic@.html
Collapse
Posted by C. R. Oldham on
Thank you, James--that is very interesting!
Collapse
Posted by James Thornton on
For a mailto: redirect to be effective, AOLserver needs this patch to nsd/return.c; otherwise, it includes the mailto: link in the server response, in the form of, "The requested URL has moved here".

http://jamesthornton.com/software/redirect-mailto.patch.txt

Collapse
Posted by Ken Mayer on
I have an emotional response everytime I see this topic come up. First, because I think spammers are evil; Second, in order to thwart harvesting, I have to reduce the utility of a web page; Third, no matter what mapping we make, it is only a matter of code that will translate the mapping back to the original address, and once that code is written it can be used everywhere. Meanwhile, we've made our own work more tedious. My personal experiences were pretty painful, too.

I used to get ~1000 spams a month, and since I was checking my e-mail, perhaps only once a month, in i-cafes, in Mexico (with speeds varying from 9600-54000 bps), pop3 transfer times were significant and costly.

I have since installed spamassassin. It seems to work well, scoring spams based on regexp patterns. I have the option of adjusting the threshhold level. So far, I toss anything that scores over 14 to /dev/null, and anything greater than 5 goes into my grey bucket. I still get spam, but less than 1 per day.

I guess my point of all this is that I found that with spamassassin. rewriting web pages to protect spam harvesters proved unneccessary.

YMMV