Fun with Python #4: Receiving free stuff in the mail
Welcome to “Fun with Python”, part 4. In this part, we will utilize web scraping and html parsing techniques to receive free merch from radio stations.
Theory and Foundations
Recently, I stumbled upon this video, in which Filip Grebowski email 50,000 companies asking for free stuff. A lot of companies have merchandise (aka swag), which they give to their employees or in various events like job fairs etc. So Filip scraped the email addresses (sort of) of 50,000 (sort of) companies and asked for a free chunk of swag.
The results were quite impressive. He even got a mini drone! After this video, a couple more popped up but they had similar techniques, so they will be skipped for now.
So how did Filip do this? Let’s break it down to what he needed in order to make it happen:
- Select your target
- Find a source that will give you lists of entities in your target
- Figure the email addresses
- Compose and send a well crafted email
Step 1 is really subjective. Filip chose companies. There is a big number of options you can select from. The trick is to select something that will help you achieve step 2.
Step 2 is the most important. If in step 1 you picked something like “horse ranches” or “skiing centres”, you are presented with a dead end. You need to pick something that probably you can find a page on the web where you can get details and (hopefully) an email address for the targets you selected. In order to traverse and download the content of these websites we will use scrapy
. According to the official website scrapy is:
An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
For sake of simplicity we will use an extra tool to parse the webpages and extract the needed data. This tool is Beautiful Soup:
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Step 3 can be optional. If you find the email addresses of the target in the webpage you can skip this step and go directly to step 4. If not, we will demonstrate one way we can “guess” the email address.
Step 4 is the simplest. You need to refresh your essay skills. Drain your brain, find the right words and ask for your swag!
Ok enough with the theory. Let’s get down to business.
Implementation
The first step on the list is to find the desired target group. The selection we are making is really dependent on the second step, which is finding a source of the target group. For this reason, I will pick radio stations.
Radio stations, as well as other businesses, have a public relationships department. People who work there are really communicative, even by nature, so the odds of getting replies to our emails are getting higher.
Next, we need to find a source. In Greece, there is this website called e-radio. I guess it is pretty self explanatory. This page contains information about a lot of radio stations located in Greece. They supply addresses, telephone numbers, fax number, web sites but surprisingly they do not provide the email address. We will need to figure it out ourselves.
Every modern organisation has a website. In order to have a website you need to have a domain name. We will use this domain name to “guess” the email addresses. But first, what is a domain name? According to wikipedia:
A domain name is an identification string that defines a realm of administrative autonomy, authority or control within the Internet
In simple words it is the address of the website, without the ‘www’ part. Meaning that if the website is www.medium.com then, the domain main would be medium.com.
Given the domain name, we will create some email addresses. Most domains come with the info@domain.com email address, so this is our starting point. After taking a closer look, I found out that the most common email addresses of radio stations are:
info@domain.com
contact@domain.com
live@domain.com
Let’s start scraping. The e-radio web page is organized per region. This means that there is a separate page that contains the radio stations for this region. And in every page there is an icon which if you press, it redirects you to the radio profile page.
If we click on the radio profile page, we will see the radio website waving at us:
Now we have a plan:
- For every region get the region page
- For every station in the region, find the profile URL
- In every profile page, parse the website URL.
As stated before, we will use Scrapy in order to parse the scrapy this information. We will create a spider that will navigate to the region pages and get the data to our machine. Then, we will use BeautifulSoup in order to parse the data we need.
import csv
import scrapy
from bs4 import BeautifulSoup
class DomainNameSpider(scrapy.Spider):
name = "domains"
def start_requests(self):
urls = [
"https://www.e-radio.gr/location/athens",
"https://www.e-radio.gr/location/thessaloniki",
"https://www.e-radio.gr/location/crete",
"https://www.e-radio.gr/location/aegean",
"https://www.e-radio.gr/location/epirus",
"https://www.e-radio.gr/location/ionian",
"https://www.e-radio.gr/location/peloponnesus",
"https://www.e-radio.gr/location/sterea",
"https://www.e-radio.gr/location/thessaly",
"https://www.e-radio.gr/location/thrace"
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_next_urls)
def parse_next_urls(self, response):
town = response.url.split('/')[-1]
soup = BeautifulSoup(response.text, 'html.parser')
radio_section = soup.find("div", {"id": "content"})
radios = radio_section.find_all('div', {"class": "stationEntry"})
with open(f'radios/{town}.csv', 'a') as f:
radio_writer = csv.writer(
f, delimiter=','
)
for radio in radios:
name = radio.find("a", {"class": "sMeta_Title"}).text.strip()
e_radio_ = radio.find("a", {"class": "sMeta_Title"}).get('href')
radio_writer.writerow([name, e_radio_])
As you can see, we create a class an define the start_requests
function, in which we define the URLs the spider will scrape and then we loop over them. For every page we request it and we parse it with the parse_next_urls
function which is set as the callback
argument. In the parsing function we are using BeautifulSoup and parse the page downloaded by scrapy. (Remember that BeautifulSoup is only a parser. It does not have the possibility to download pages.) For every station, we will save the name and the profile page URL in a csv file.
Having done that, we will construct the URL for every profile page, download it using the requests
module and parse it with BeautifulSoup to get the domain name:
for file in os.listdir('radio_scraping/radios'):
new_rows = []
with open(basedir + file) as f:
print(f'Handling file {file}')
csv_reader = csv.reader(f, delimiter=',')
for row in csv_reader:
url = 'https://www.e-radio.gr' + row[1]
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
profile_info = soup.find("div", {"class": "profileInfo"})
try:
d_n = profile_info.find_all("a")[0].text.lstrip('http://www.').lstrip('https://www.').split('/', 1)[0]
print(d_n)
domain_name = d_n
except Exception:
domain_name = ''
new_rows.append([row[0], row[1], domain_name])
with open(basedir + file, mode='w') as f:
csv_writer = csv.writer(f, delimiter=',')
csv_writer.writerows(new_rows)
Now, in the save csv files, we have the radio station domain name.
Next, let’s generate the “guessed” email addresses by prepending the strings info@
, live@
and contact@
:
import csv
import os
if __name__ == '__main__':
basedir = os.getcwd() + '/radio_scraping/radios/'
email_addresses = []
for file in os.listdir('radio_scraping/radios'):
with open(basedir + file) as f:
csv_reader = csv.reader(f, delimiter=',')
for row in csv_reader:
email_addresses.append('info@' + row[2])
email_addresses.append('live@' + row[2])
email_addresses.append('contact@' + row[2])
with open('email_addresses.txt', 'w') as f:
for item in set(email_addresses):
f.write(item + '\n')
After generating them, we save them in a .txt file.
Finally, we need to send the emails. In order to send these emails, we are using SendGrid that provides a free plan (with a sending limit though). Write a nice email asking for your swag and you are ready to go! SendGrip offers a Python SDK which is really easy to use.
Conclusion
And we are done! It takes some effort to find and parse the email addresses but the rest can be done in no time. Of course, there are steps you can take that can make this script better:
- SendGrid sending limits
SendGrid free plan enforces a daily limit on the outbound emails. So if you have lots of email addresses, you need to send them on different days. As we said in Fun with Python #3, we can put this script in a server as well and run daily. - Addresses
Think about before sending the email. Make sure you do not add personal information in the email you will send. Your address and/or telephone number can be exploited. - Has anything arrived?
Well yes! I have gotten some responses as well. Some of them have even invited me to visit them at their studio! At the time that this article is written, I already have received a package.
I will wait a while and then, I will update the article with all the stuff I receive.
This concludes this article. I hope you enjoyed reading it and try it yourself. Let me know your thoughts and your ideas about it! In the meanwhile, you can find the rest of the “Fun with Python” series here.