Building a Scraping Bot to Fetch Data and Improve Search Capabilities

By Omer Sha'ashua
January 15, 2022

Automation, AWS, Lambda, Python, Selenium

One of our clients came with a request that included a problem, or as we like to call it – a challenge. The client wanted to display information from another website on his website. He received the necessary approvals from the website owners but when he asked how to do it, he did not get a clear answer. As a start, we tried to understand whether the website has an API that allows reading information from it and were answered in the negative. Accordingly, we needed to find a solution that would allow us to read the information without an API.

Objectives

To display information from another website on the client’s website
To bypass the limitation created due to the absence of an API on the website from which the information is read
To create a scraper with Python and Selenium
To save the information in the client’s website database in order to reduce Lambda runtime to save costs – kind of a cache process
To improve technical times by reading the data from the client’s website database if it has been searched for before

Process

As already mentioned, the client wanted to display information from another website on his website but had a limitation. The website from which he wanted to pull the information did not have an API that would allow reading the information and displaying it on the client’s website. To solve this challenge, we built a scraper using Python and Selenium which we stored on AWS Lambda that is triggered by an API call. This allowed us to read the information and receive it as a response to a request from the client’s website where we wanted to display the information.

Initially, the development was done locally to accelerate and facilitate the development of the scraper and also to save costs on AWS. The purpose of the scraper is to display search results from another website on the client’s website. Therefore, the scraper receives a search query dynamically, opens the website from which we want to read the data, performs the search, reads the results, and returns them as an API request response using JSON. When we finished developing the scraper itself and after we tested it, we stored it in AWS Lambda and assigned it an address so that we could trigger it and receive the JSON it returns.

Now came the stage where we connect it to the website. In the website development environment (because it is not recommended to work on a production environment), we developed, with code, of course, the part that takes the search query entered on the client’s website and approaches the scraper we built to receive the information from it. When the scraper returns an answer, the website saves the returned data in the database and displays it to the client.

The information saved in the website’s database allows us to know which search queries already have results in the website’s database and when was the last time a request was made for those search queries to the scraper. Thus, we created a cache mechanism that allows us to save requests to the scraper and thus significantly reduce AWS costs and also improve the response speed on the client’s website. This happens because a request to the scraper takes time until the scraper itself returns a response.

With this data, we created a cache mechanism that automatically clears every 30 days. Meaning, that when a user searches for a specific query, as long as 30 days have not passed since the last time the information was read from the scraper for that definition, the information will be read from the client’s website’s database. If 30 days or more have passed, a new request will be sent to the scraper and the website will update the existing data for that search query with the new information for another 30 days.

Results

After integrating the scraper into the client’s website, the process of fetching the information from the website without the API worked successfully and seamlessly, in a way that is not noticeable to the user on the client’s website except for loading times when searching for a search query that was not searched before or its cache is outdated. As we said, for search queries that have already been searched for before on the website, there is no need to approach the scraper. Using Python and Selenium, we managed to bypass the limitation of the absence of an API and read the information from the website despite this limitation.

In addition, we created a cache system on the client’s website that allowed us to improve the user experience during a search on his website. Not only that, creating the cache system also allowed us to significantly reduce the number of requests to the Lambda stored in AWS, which significantly reduced its running costs as well.

Risks

Reading information from other websites using a scraper has several risks that need to be emphasized before starting such a project. One of the biggest risks in such a project is the prohibition of certain websites from reading the information from them using scrapers and bots. In such cases of reading the information without explicit written approval from the website owners, we can expose ourselves to a legal lawsuit since we are violating copyright laws.

Moreover, reading information from websites using scrapers that do not have an API may violate their terms of use. To minimize the risk as much as possible, it is important to get explicit written approval from the website owners from whom we want to read the information using a scraper before we even start working on the project.

Another risk that needs to be taken into account is the possibility that the website from which we want to read the information may block us. Today, many websites and services know how to identify bots and scrapers very quickly, by actions, usage patterns, etc., such as captcha. To overcome this, it is important to limit the pace of the bot or scraper to try to mimic human behavior. In addition, by embedding random actions, such as random waiting time between clicks, changing the user agent, and more, we can reduce the chance that our bot or scraper will be identified as such and sometimes even prevent it entirely.

In this project, we took steps to minimize the aforementioned risks by obtaining the necessary approvals from the website owners from whom we read the information. This allowed us not only to carry out the project without worrying about the legal side but also without fear that the website from which we read the information would block the scraper we built for the client. At the end of each stage, we conducted tests to confirm that everything worked as it should, and only after we saw that indeed it did, we moved on to the next stage. In addition, all the work was done in a development environment and only after the entire work was completed, was it transferred to a production environment. Thus, we were able to work and develop without fear of creating errors on the website since it is a development environment, the client, in the meantime, could continue to perform actions on his website such as uploading new content, updates, etc., without fear.

Conclusion

By building the scraper using Python and Selenium, we managed to overcome the limitation of the absence of an API on the website from which we wanted to read information to display it on our client’s website. We successfully implemented a cache mechanism that contributes both to the user experience in terms of loading times and also lowers AWS costs for the client.

By obtaining the necessary approvals and creating a development environment, we managed to minimize many risks and significantly reduce (almost to zero) the number of errors on production and legal and technical problems that could have been created if we had not done these things.

For more projects, click Here

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Building a Scraping Bot to Fetch Data and Improve Search Capabilities

Objectives

Process

Results

Risks

Conclusion

More Projects

Integrating GreenHouse into WordPress

Display Posts On Other Websites

SMS Automation With Make: Improve service with Technology

Login Limits and Max Concurrent In WordPress: Enhancing Security and Subscription Value

Links

Company