Google Sheets are my new love. They haven’t been before – to be honest: I avoided spreadsheets as much as I could. But ever since I started my current job, I have learned to love working with spreadsheets.
Lately, I have been playing around with Google Sheets to get a quick overview of live webpage data without using any other tools.
Using IMPORTXML function to scrape live webpage data
Google Sheets offers a wide variety of functions. One of these is IMPORTXML.
This function imports data from any of various structured data types including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds.
With the help of IMPORTXML and XPath queries, you can basically scrape information from webpages and list them in your spreadsheet.
The data you would like to extract from the page does not have to be visible on the page itself – it’s okay if it’s just in the HTML.
How to actually use IMPORTXML to scrape website data
In the following example, we will scrape the HTML <title> from a webpage. All we need to do is know some XPath basics (no worries, I will show you how to do it).
- Open up a ne Google Spreadsheet
- Enter a URL into cell A1
- Go to cell B1 and enter the following code
- Press “enter”
Now you will see a “loading” text within the cell B1 – and once it’s done loading, you will see the <title> of the page.
See, that was easy :) Now let’s continue and find some more stuff!
Some Xpath queries to help you find page information
I have collected some XPath queries which you can use to collect useful information. Simply add these queries to the IMPORTXML function as you did with the “//title” in the example above.
Get the meta description
Get the heading 1 (h1)
Get the canonical tag
Getting other HTML elements from a webpage
Now let’s say, you want to scrape some Wikipedia articles to see the date they were last edited.
To get the XPath query for the HTML element you’re looking for, simply open one of the pages in Google Chrome and right click on the element you want to extract. Then, click “Inspect” to open the developer tools.
The element you clicked on should be highlighted in the “Elements” tab in the developer tools – just like you can see in this image:
Now, right click on the highlighted line, and in the contect menu select “Copy” -> “Copy XPath”.
Now you have the XPath query in your clipboard – and you can simply paste it into your IMPORTXML function in your B2 cell (like in the screenshot below). You might need to replace double quotes in your copied XPath query for single quotes (around the id selector).
After entering the code above and hitting “Enter”, you should see the “last edited” text for the URL in cell A2.
Now, drag the cell B2
Importing an XML sitemap into Google Sheets
Sometimes, you want to get an XML sitemap overview of some website. You know the sitemap URL, so why not use the IMPORTXML function to do just that?
Important: I will use two functions in adjacent cells to get both the URL of the referenced pages as well as the lastmod date.
So the setup is:
A1: URL of the XML sitemap
B1: FUNCTION 1 to get the referenced URLs
C2: FUNCTION 2 to get the lastmod date
Function 1 (referenced URLs)
=IMPORTXML("https://haensel.pro/post-sitemap.xml", "//*[local-name() ='url']/*[local-name() ='loc']")
Function 2 (lastmod dates in cell C1)
=IMPORTXML("https://haensel.pro/post-sitemap.xml", "//*[local-name() ='url']/*[local-name() ='lastmod']")
The output should look like this.
Try it out
You can try this with my sample spreadsheet. Just make sure you make a copy of this before you try editing it :)
This is a simple way of getting some useful data rather quickly and without having to use any other – sometimes costly – software.
Let me know what you think
If you have more useful XPath queries or any other improvements, let me know in the comments below and I will add them to this article.