reading-notes

Class 17: Web Scraping

Web Scrape with Python in 4 minutes

Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.

When you begin you have to have access to the websites turnstile data.

Example from the article:

http://web.mta.info/developers/turnstile.html

Depending upon the website they may have varying intervals at which they compile data. The example from the article has data that is compiled every week.

Important notes about web scraping:

Read through the website’s Terms and Conditions to understand how you can legally use the data. Most sites prohibit you from using the data for commercial purposes.


Make sure you are not downloading data at too rapid a rate because this may break the website. You may potentially be blocked from the site as well.

Inspecting the Website

Python Code

Libraries you will want to import:

Set the URL to the website and access the site with your requests library.

url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)

If the access was successful, you should see the following output:

Input: response
Output: <Response [200]>

Response 200 means that the access request went through

Next, parse the html with BeautifulSoup so that you can work with a nicer, nested BeautifulSoup data structure.

soup = BeautifulSoup(response.text, “html.parser”)

Use the method .findAll to locate all of the <a> tags.

soup.findAll('a')

Next, let’s extract the actual link that we want. Let’s test out the first link.

one_a_tag = soup.findAll(‘a’)[38]
link = one_a_tag[‘href’]

In the article the data they were after started on Line 38, hence [38]

The above code block only saves part of the path name:

Discrepancies like this can be discovered by hovering over the link and allowing the browser to display the pathway of the actual link or clicking the link and looking at the URL bar of that links landing page.

Provide request.urlretrieve with two parameters: file url and the filename.

download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

You should include this line of code so that you can pause our code for a second so that you are not spamming the website with requests. This helps avoid getting flagged as a spammer.

time.sleep(1)

The code below contains the entire set of code for web scraping the NY MTA turnstile data.(Written by Julia Kho)

# Import libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

# Set the URL you want to webscrape from
url = 'http://web.mta.info/developers/turnstile.html'

# Connect to the URL
response = requests.get(url)

# Parse HTML and save to BeautifulSoup object¶
soup = BeautifulSoup(response.text, "html.parser")

# To download the whole data set, let's do a for loop through all a tags
line_count = 1 #variable to track what line you are on
for one_a_tag in soup.findAll('a'):  #'a' tags are for links
    if line_count >= 36: #code for text files starts at line 36
        link = one_a_tag['href']
        download_url = 'http://web.mta.info/developers/'+ link
        urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:]) 
        time.sleep(1) #pause the code for a sec
    #add 1 for next line
    line_count +=1

GitHub for the code written/described in this article

What is Web Scraping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

Web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.

Web scraping a web page involves fetching it and extracting from it.

Once fetched, the content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, or many other needed manipulations to achieve the desired end result.

Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else.

Many uses of Web scraping:

Track Amazon Prices













Beautiful Soup