Scraping
This module contains information on scraping data from Google Scholar.
This module utilizes the selenium package to automate web scraping
and stores the scraped data in 'data/scraped.json' file. It scrapes,
for each author_id provided, all of the publications for that author listed
on Google Scholar. For each publication, the journal title and list of authors
are extracted.
get_publication_data(author_id)
Retrives data from Google Scholar.
This function is the primary web-utility that opens a new selenium window to automate visiting Google Scholar and extracting data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
author_id |
str |
Google Scholar ID of the author information to extract. |
required |
Returns:
| Type | Description |
|---|---|
list |
list[dict[str, str]]: A list of publication data, each dictionary containing
keys for the |
Source code in scholar_network/scraping.py
def get_publication_data(author_id: str) -> list[dict[str, str]]:
"""Retrives data from Google Scholar.
This function is the primary web-utility that opens a new selenium
window to automate visiting Google Scholar and extracting data.
Args:
author_id (str): Google Scholar ID of the author information to extract.
Returns:
list[dict[str, str]]: A list of publication data, each dictionary containing
keys for the `journal_title` and `authors` both having string keys.
"""
driver = webdriver.Safari()
data = []
profile_link = f"https://scholar.google.com/citations?user={author_id}&hl=en&oi=ao"
driver.get(profile_link)
pub_elements = driver.find_elements_by_css_selector("a.gsc_a_at")
for pub in pub_elements:
pub_info_link = pub.get_attribute("data-href")
driver.get(f"https://scholar.google.com{pub_info_link}")
elements = driver.find_elements_by_class_name("gsc_vcd_value")
if len(elements) > 3:
data.append(
{"journal_title": elements[2].text, "authors": elements[0].text}
)
driver.back()
driver.close()
return data
scrape_single_author(scholar_id, scholar_name='')
Scrapes data from google scholar and saves into json file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scholar_id |
str |
Google Scholar ID of the author information to extract. |
required |
scholar_name |
str |
Name of the scholar. Defaults to ''. |
'' |
Source code in scholar_network/scraping.py
def scrape_single_author(scholar_id: str, scholar_name: str = ''):
"""Scrapes data from google scholar and saves into json file.
Args:
scholar_id (str): Google Scholar ID of the author information to extract.
scholar_name (str, optional): Name of the scholar. Defaults to ''.
"""
pub_data = get_publication_data(scholar_id)
helpers.append_pub_data_to_json(pub_data)
print(f"Wrote {scholar_name if scholar_name else scholar_id} to file.")