What we do in life echoes in eternity!
sys7em
Home About Me Diary

Collecting Harvesting email addresses from websites

posted on: 14:09, September 28th , 2009

Recently I was asked to create a script to gather all email addresses from a certain website.

The script was to traverse all the pages on the site, find email addresses and save them to a file.


Now, you might think that this is a very complex task, as you would need to write a lot of code that would look for links on the pages to find more pages and so on, but with smart use of some common Linux utilities it's a very simple job.



If you have used Linux from the console a lot, you probably know wget and grep.

wget is basically a tool that can be used to download files off the internet, but what many don't know is that it has some very advanced recursive fetching features.

grep is a tool used for searching for occurences of strings in files. It supports regular expressions, so finding email addresses in files is very easy with it.


Can you see where this is going? :sherlock:


With a single command, we can fetch all HTML files from a site with wget, and with an another command, we can go through those files and find all email addresses with grep! How easy is that?

wget -nv -nH -r -A html --ignore-tags=img,link www.example.com





With that line, wget gets all the html pages from a site to the current directory. The parameters do the following:
-nv makes wget's output less verbose but displays basic info so you can see what file it's downloading.
-nH stops wget from making a directory under the current one with the name of the site
-r makes wget perform a recursive fetch
-A html makes grep only download files with .html extension
--ignore-tags=img,link makes wget not look for URLs in img or link tags. This is because it's very unlikely that either of those would contain a link to an another html file.


Depending on the amount of pages, the wget command may take a very long time. In my case, it took 13 hours.

After wget is done, you will have all the pages in the current directory and maybe some in subdirectories if they were in ones in the server. Next, run grep...


grep -Eiorh '([[:alnum:]_.]+@[[:alnum:]_]+?\.[[:alpha:].]{2,6})' ./ > emails.txt





This makes grep find all email addresses and write them to file emails.txt in the current dir.
-E makes grep use extended regular expressions
-i does a case-insensitive match
-o makes grep return only the matching part of the line and not the whole line
-r makes grep look files recursively
-h stops grep from outputting the filename where the match was found

with the r parameter and the source file as ./ grep will go through all the files in the current directory and sub directories. The weird looking string in the line is the regular expression which looks for emails.

Comments:

Write a comment

*Name
E-mail:
*Security code:
*1+2 = (write in words ;)
*Message

All materials on this site are licensed under the following license: "Steal every piece of information you can get your hands on and run as fast as you can "