Web Scraping with Python: Extracting URLs from Websites

Web Scraping with Python: Extracting URLs from Websites

Web Scraping in Python.

There are about 1.7 Billion websites on the internet. Almost everyone is connected on social media and it is very important for us in order to connect with the website's company to know there social handles. Visiting websites individually and then finding the social media handles is a tedious task considering there are a lot of websites out there. This is where this code works, use the below steps to find and scrap the social media links of websites you wished for. We will be extracting LinkedIn links as a part of this blog post but you can scrap any social handle really by using the below steps (and I will tell you when to slightly change your code to do that).

Let's get started!

We will be using Python for this. I am using VS code for editing and running the code, you can use any IDE of your choice. But if you wish to use VS Code too, download it from here.

Here is a detailed explanation of the code:

The code starts by importing several modules and libraries that it will use later on:

import requests
from bs4 import BeautifulSoup
import csv
from urllib.request import Request,urlopen
import ssl
from pandas import *
import re
  1. Next, the code prompts the user to input the name of an Excel file containing a list of company URLs:

    urlfilename = input("Please put in your excel file name with the extension (.csv) ")
    
  2. This file is then read into a pandas dataframe, and the URLs are extracted into a list called CompUrls:

    data = read_csv(urlfilename)
    CompUrls = data['Company Website'].tolist()
    
  3. The code then initializes empty lists called foundurl and comUrl, which will be used to store the LinkedIn URLs and company URLs, respectively. Another list called notFoundUrls is also initialized to store URLs for which no LinkedIn URLs were found:

    foundurl = ['LINKEDIN URLS']
    comUrl = ['COMPANY URLS']
    notFoundUrls = ['NOT FOUND LINKEDIN URLS']
    
  4. The code then loops through each URL in CompUrls. For each URL, the code attempts to make an HTTP request to the URL and parse the resulting HTML using BeautifulSoup. If the request fails, a message is printed to the console and the URL is added to the comUrl list.

    product = "asana"
    for reqUrl in CompUrls:
        try:
            newUrl = ''
            if reqUrl[:2] == '//':
                newUrl = 'https://' + reqUrl
            else:
                newUrl = reqUrl
    
            reqs = Request(newUrl, headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'})
            webpage = urlopen(reqs, context=ssl.SSLContext()).read()
            soup = BeautifulSoup(webpage, 'html.parser')
    
            urls = []
            allurls = []
        except Exception as e:
            print(reqUrl + " | server error ")
            comUrl.append(reqUrl)
            foundurl.append("server error")
            continue
    
  5. If the request is successful, the code searches the HTML for any links that match the pattern of a LinkedIn URL. If a match is found, the LinkedIn URL is added to the foundurl list, and the original URL is added to the comUrl list. If no match is found, the original URL is added to the comUrl list and a message is printed to the console:

  6. How does this work?

    The code first initializes an empty list called allurls and uses the find_all method of the BeautifulSoup object to find all <a> tags on the page. It then loops through these tags and appends the href attribute of each tag to the allurls list. This creates a list of all URLs on the page.

    Next, the code loops through the URLs in the allurls list and attempts to find any URLs that match the pattern of a LinkedIn URL. To do this, the code uses a regular expression to match the URLs that have "linkedin.com" in the path. If a match is found, the code prints the LinkedIn URL to the console, appends the URL to the foundurl list, and appends the original company URL to the comUrl list.

    If no match is found, the code sets the flag variable to "not found" and appends this value to the foundurl list along with the original company URL to the comUrl list. The code also appends the original company URL to the notFoundUrls list.

    At the end of the loop, if the flag variable is still set to "not found", the code prints a message to the console indicating that no LinkedIn URL was found for the company website.

    flag=''
        for link in soup.find_all('a'):
            allurls.append(link.get('href'))
    
        for url in allurls:
            try:
    
                if(re.match("(\s)*([\w-]+\W+)*linkedin\.com(\/).*",url)):
                    print(reqUrl+ " | " + url)
                    foundurl.append(url)
                    comUrl.append(reqUrl)
                    flag='yes'
                    break               
                else:
                    continue
            except:
                continue
            finally:
                continue
        if(flag!='yes'):   
            foundurl.append("not found")
            comUrl.append(reqUrl)
            notFoundUrls.append(reqUrl)
            print(reqUrl+" | not found")
    
  7. The following code is used to write the results of the web scraping to a CSV file. It first initializes two variables, filename and notFoundfile, which will be used as the names of the output CSV files.

    The code then opens the filename file for writing, creates a csv.DictWriter object, and writes the headers for the file. This creates an empty CSV file with the appropriate column names.

    Next, the code opens the filename file again, this time using the csv.writer class to write the rows of data. It zips together the comUrl and foundurl lists to create a list of tuples, where each tuple contains the company URL and the LinkedIn URL found for that company. The code then writes these tuples to the CSV file using the writerows method.

    This creates a CSV file that contains the company URLs and the LinkedIn URLs found for those companies.

    notFoundfile = 'notfoundLinkedin.csv'
    
    with open(filename, 'w') as f:
       dw = csv.DictWriter(f,delimiter=',',fieldnames=['Company Url','LinkedIn Url'])
       dw.writeheader()
    
    with open(filename, 'w', newline='') as f:
       w = csv.writer(f,['Company Url','LinkedIn Url'])
       w.writerows(zip(comUrl,foundurl))
    

    If you want to scrap other social media handles, just try to make a regular expression pattern for the respective social media and replace it in the code, few of the regex pattern I have made are :-

    Youtube - (\s)*([\w-]+\W+)*youtube\.com\/channel(\/).*

    Twitter - (\s)*([\w-]+\W+)*twitter\.com(\/).*

    Pinterest - (\s)*([\w-]+\W+)*pinterest\.com(\/).*

    Facebook - (\s)*([\w-]+\W+)*facebook\.com(\/).*

    Instagram - (\s)*([\w-]+\W+)*instagram\.com(\/).*

    Thank you for reading this blog on web scraping with Python. I hope you found the information helpful and that you now have a better understanding of how to extract data from websites using Python. If you have any questions or feedback, please feel free to leave a comment below. Thank you again for reading!