Web Scraping in Python || Tool for Digital Marketing- AIMarketer

January 24 2021

The fuel for Digital Marketing as we all know is "Data" and Internet is full of it. As per statistics "It is believed that over 2.5 quintillion bytes (2.5 e+9 GB) of the data is created every day"(Source takeo.ai )With this much data generated every day it is really difficult to get specific as per your need.

There are multiple ways to get data available publicly "Web Scraping" is one of it. In this post I will walk you through step by step to extract information needed from websites.

Use Case: Extract email id's of Company Directors.

Assumptions:

Company CIN- Will be used as input to our script for finding email.
Chrome Driver Installed - Script will be based on Chrome Driver.
Necessary packages installed
Xpath's of the elements used are extracted by inspecting elements or by using using available XPath extraction tools (Chropath is what I prefer)

Step 1: Import necessary packages using import statements.

import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time

Step 2:Reading Data to pass as input to Script.

path='./Data_file.csv'
April_cin_dir={"CIN":[],"Director":[]}
df=pd.read_csv(path)
cin_list=df.values.tolist()
url='https://www.zaubacorp.com/' #Change with your desired website

Step 3:Chrome driver Initialization to open webpage

driver=webdriver.Chrome()
driver.maximize_window()

Step 4:Defining function for Opening Webpage

def open_app(url):
    driver.get(url)

Step 5:Defininfg function for extracting email id

def get_director(cin_list):
    counter=0
    WebDriverWait(driver,10)
    try:
        for i in cin_list:
            counter+=1
            driver.find_element_by_xpath("//input[@id='search-com']").send_keys(i) #Finding search box on the page and entering search criteria.
            driver.find_element_by_xpath("//button[@id='edit-submit--3']").click() #Finding Submit button and sending click event
            time.sleep(5) #This becomes savior in most of the case as webpages take time to load post click 
            April_cin_dir["CIN"].append(i)
            April_cin_dir["Director"].append(driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/div[2]/div[10]/table/tbody/tr/td[2]/strong/a").text) #Traversing till email field and extracting it's text
    except: #Except condition to handle if email not present.
        April_cin_dir["Director"].append("Not Present")
        get_director(cin_list[counter:])

Step 6:Calling Custom Function to Run Email Extraction Script and Storing Extracted date in CSV

open_app(url)
get_director(cin_list)
driver.close()

df_dir=pd.DataFrame.from_dict(April_cin_dir)
df_dir.to_csv("Nov_cin_director_opc1.csv")

I have Assumed Finding XPATH is known to you. In case you want me to right that, please comment. Will share insights on that.

Thanks for reading through.