News & Insights

BACK TO ALL ARTICLES

Automate Redirects Using Python + Google Colab

pwd staff OLIVER WOOD
Oliver Wood

|11th September 2022

Doing redirects is one of the most painstaking SEO tasks. Yet, it is one of the most important and carries a large margin for error. This is especially true if you migrate a website with thousands of pages or products.

The SEO specialists at PWD ran into an issue where a client wanted to change the URL structure of their eCommerce website to use product names instead of SKU numbers. The client’s old website was not built on WordPress and contained over seven thousand products.

After speaking to Mark Williams-Cook from Candour in the UK, who also ran into this issue, we decided to make a tool that could automate the process.

The solution was to build a Python script in Google Colab that could match up URLs using TF-IDF to compare the similarity of elements in a Screaming Frog or Sitebulb crawl. For example, you could match URLs based on the H1 headings, page titles, meta descriptions, etc.

While the process still requires some manual work to ensure the URLs are matched correctly, it usually automates a large number of redirect matches.

How the Redirect Matching Tool Works

1. Crawl the Origin and Destination Websites

Use a tool such as Screaming Frog or Sitebulb to crawl the origin and destination sites. Export the crawls as .xlsx sites, and both crawls should contain the same columns and column headings.

Tip: Mac users can duplicate their Screaming Frog install and run two crawls at once by simply duplicating the Screaming Frog application in the Applications folder.

2. Run the First Cell to Import the Required Python Libraries

Once you have both crawls ready, click the Run button of the first cell to run the script. The first cell will import the necessary Python libraries for the script to run.

3. Load the Origin Crawl Data

Once the first cell has finished executing, hit the Run button on the second cell. This will prompt you to upload your origin crawl file (a .xlsx file). Once the origin crawl data is uploaded, choose the column you want to use to match the pages.

Tip: If the standard crawl data doesn’t provide you with an appropriate matching target, you can use the custom extraction feature in Screaming Frog to choose something better. For example, an SKU number in a span tag. The element you want to match can be selected via XPath or a CSS selector.

4. Upload the destination crawl data

Run the fourth cell, upload the destination crawl data and select the column you want to use as a match target as you did in the previous step.

6. Match URLs Using Crawl Data

Cell number five applies a TF-IDF matching algorithm to the data in the columns you have selected. Once the process is complete, the script will also generate a precision-recall graph to give you a visual representation of the results.

5. Manually Verify the Results

Once the process is complete, you will need to run your eyes over the output to correct any mismatched URLs manually.

If the script doesn’t work for you, or you have some feature suggestions, let us know! We would love your feedback.

The Google Colab Script: colab.research.google.com/drive/1tweLXSxdiXDle_kNKTrJNZxlscglq_Xi?usp=sharing