Overblog
Edit post Follow this blog Administration + Create my blog
AI in Digital Marketing

"Things are complex till they are mastered." A popular saying from "Gyani Smita" :) According to SlashData, there are around 8.2 Million Python Developers in the world still low penetration in automation of Digital Marketing

Web Scraping in Python || Tool for Digital Marketing- AIMarketer

The fuel for Digital Marketing as we all know is "Data" and Internet is full of it. As per statistics "It is believed that over 2.5 quintillion bytes (2.5 e+9 GB) of the data is created every day"(Source takeo.ai )With this much data generated every day it is really difficult to get specific as per your need.

There are multiple ways to get data available publicly "Web Scraping" is one of it. In this post I will walk you through step by step to extract information needed from websites.

 

Use Case: Extract email id's of Company Directors.

Assumptions:

  1. Company CIN- Will be used as input to our script for finding email.

  2. Chrome Driver Installed - Script will be based on Chrome Driver.

  3. Necessary packages installed

  4. Xpath's of the elements used are extracted by inspecting elements or by using using available XPath extraction tools (Chropath is what I prefer)

Step 1: Import necessary packages using import statements.

 
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time
 

Step 2:Reading Data to pass as input to Script.

 
path='./Data_file.csv'
April_cin_dir={"CIN":[],"Director":[]}
df=pd.read_csv(path)
cin_list=df.values.tolist()
url='https://www.zaubacorp.com/' #Change with your desired website

Step 3:Chrome driver Initialization to open webpage

 
driver=webdriver.Chrome()
driver.maximize_window()


Step 4:Defining function for Opening Webpage

 
def open_app(url):
    driver.get(url)

Step 5:Defininfg function for extracting email id

 
 
def get_director(cin_list):
    counter=0
    WebDriverWait(driver,10)
    try:
        for i in cin_list:
            counter+=1
            driver.find_element_by_xpath("//input[@id='search-com']").send_keys(i) #Finding search box on the page and entering search criteria.
            driver.find_element_by_xpath("//button[@id='edit-submit--3']").click() #Finding Submit button and sending click event
            time.sleep(5) #This becomes savior in most of the case as webpages take time to load post click 
            April_cin_dir["CIN"].append(i)
            April_cin_dir["Director"].append(driver.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/div[2]/div[10]/table/tbody/tr/td[2]/strong/a").text) #Traversing till email field and extracting it's text
    except: #Except condition to handle if email not present.
        April_cin_dir["Director"].append("Not Present")
        get_director(cin_list[counter:])
        
 

Step 6:Calling Custom Function to Run Email Extraction Script and Storing Extracted date in CSV

 
open_app(url)
get_director(cin_list)
driver.close()

df_dir=pd.DataFrame.from_dict(April_cin_dir)
df_dir.to_csv("Nov_cin_director_opc1.csv")

I have Assumed Finding XPATH is known to you. In case you want me to right that, please comment. Will share insights on that.

 

Thanks for reading through.

 
Share this post
Repost0
To be informed of the latest articles, subscribe:
Comment on this post