Writing books is easy!! Oh, and knowing basic Python helps!
The Code: https://github.com/fakesamgregory/a-z-web-scraper
The Books: https://thefullstackagency.xyz/shop
What I want to do in this video, is go through how I built a web scraper that effectively wrote my book for me. If you’re new here welcome my name is Samuel Gregory and on this channel we talk about all things full stack agency that is web development, web design, web strategy and launch processes. And beyond we also talk about no code tools like Webflow and Pinegrow.
So, if you’re interested in that kind of stuff and you’re growing a full stack agency then subscribe and oh and I’ll leave a link to my discord server where we talk about all things web related and I’d love to see you there.
Now on, with the show this year, I released two books Lingo Startup and Lingo Agile in a relatively short amount of time I think total, I probably spent about one two months on releasing two books. Um..they’re available on full stack agency and they’re downloadable for free. But in this video, I kind of wanted to go through my processing.
Now, I had help with a web scraper in order to be able to write these books in such a short amount of time oh and the GitHub link is in the description below right below the like button.
Now, I make no secret in the book that I used existing glossaries online to kind of formulate my knowledge. I don’t know every single agile word and I’m hoping to write a lingo series I will not know every single word in the series so I would go online and learn about it and expand my vocabulary and then try and understand and then simplify it for the book I thought it’d be a great idea to go online and scrape existing agile glossaries to kind of give me a starting point.
Now, I’ve not touched python ever this is my first time learning python so I just wanted to give you an idea of the the level I’m coming at, so I came across this library called beautiful soup because I knew that I could make a request in python but I needed a way to traverse the dom like you would in something like jcrew. Just as one point to make some of the content you will be scraping may be copyrighted. Now, I can’t make that decision for you or how you navigate that situation but just be aware that you may be scraping copyrighted information. So, keep that one on top of mind.
So, here’s an example of an agile glossary on agilealliance.org and you can see here that I’ve got a list of basically agile terms and their definitions. So, looking at the code here first of all I created the full URL that obviously is the URL to where the glossary is stored. Now, I’ve broken it up for for various reasons which we’ll get into but ultimately we’re left with a URL that is the full URL lead into where that glossary is, next I make a get request and this is a built-in python library which I can use and I make a get request to that URL I set a timeout in case there’s a problem and I don’t just keep bombarding the server and then finally I set headers to let the server know that I’m a normal browser in some instances it would actually block me and give me an unauthorized response. So, by setting the headers you can kind of bypass the unauthorized response.
So, we get that response and we then pass it into that fabulous library called beautiful suit which is html and then we parse it using the html parser, now this enables me to traverse the object and and treat it like something like jcrew jumping back into here is I want to kind of find the dom element that’s wrapping all of these terms. So, if we inspect the code here and look for the wrapping so it looks like that one there but then I want something a bit more nested okay so it looks like we’ve got so wp wrapper looks like it’s a nesting thing but the thing I’m conscious of here is that we’ve got a list of links which I don’t want to scrape. So, jumping back here what I’ve done is find a div with an id of top which is this one and I’ve got the parent here and then I’m finding all of the divs with a class wpb wrapper so there we go.
So, within this pb wrapper I’m then diving into each of the elements with the each of the divs with another class of pb rapper, and that’s obviously what I’ve identified as being the unique kind of attributes on this page that allows me to kind of traverse this dom so, this is..this is the most manual part of this process and as I built scrapers for different web pages it would just be a case of just finding the classes the attributes and and trying to be as specific as I can by targeting certain elements sometimes it’s as simple as grabbing a wrapping element and then finding all of the links because I want to kind of loop through them all.
But in this case, I’ve gone id top got the parent and then from within the parent found wpb wrapper so with all our links stored in this array we actually loop through that array and there’s a few different ways we can do it but um in this instance I’ve chosen to do it in a for loop um and if we can find the a paragraph and we find a h2 because, I’m guessing that this is potentially not a h2 so there’s a h2 but it doesn’t have the um the paragraph within it uh so we can ignore that one I’m probably guessing the same here so this doesn’t have a this doesn’t have a paragraph but it doesn’t have a h2 so that’s where we’re skipping those first two thumb elements.
So, if we those do exist then I can safely say that the term itself is in the h2 and I can get the text there and the definition is in the p if..if those two exist again just a few safety checks there then it means we found both of those items I store in an object and then push that to…to an array….
Stored in an array if that array exists, I’m connecting through to google sheets here, now this is a separate file which I’ve created which essentially does everything that I need to connect with the google spreadsheet and then get the words or write to that spreadsheet. Get all the words in the google spreadsheet I then take I loop through each of the words in the spreadsheet and check if that word already exists, if it doesn’t already exist, and I’m adding it to the spreadsheet if it does exist to the spreadsheet I am updating and I’ll get specific about what we update um just in a moment so if it does exist then we’re adding this this um term to the update and now this is specific to google spreadsheets.
What I’m essentially doing is telling it which range I want to update um and the values itself which I want to update so I’m not updating any words I’m not updating any different definitions, all I’m doing is checking the box making sure the agile one is checked if it’s not found then what I’m doing is I’m adding um the the actual term and definition to the add list and as you can see here the first element there is the term, the definition, I just set a comment there and tick that agile box and then just for fun, I actually add the URL to the spreadsheet just so I’ve got a link back, which I…I give credit to in the book.
So, then finally once I’ve gone through all of the words then it’s time to write to the google spreadsheet because I don’t want to loop through a word and add it because you can only you only get a certain amount of requests to google sheets before it kind of rejects what it is that you want but I’m going to comment that out because I don’t want to update the spreadsheet because it’s already done and I’ve released the book and then it’s just a simple case in VS Code to hit the run button here which appears if you’re working in python files and you can see it’s found all of those words, uh.
It’s getting all those words and then checking that some of these words already exist so most all of these words should already exist because I’ve um.. because I’ve already written them except that some words are spelled differently and I kind of needed to clean that up so there is a manual process and of course I wanted to write my own definitions but this gave me a solid starting point to be able to kind of work just in the spreadsheet and update the definitions.
So, finally here is an example where I needed to actually click on a link to get the definition itself. So, again if we just head on over to this URL you can see here in this website, that’s just the links to the terms. So, what I need to do is loop through each of those terms click on them and then I’ll get the definition here.
So, looking at the code here I’m obviously getting all of the links within the my the container my five div this is just getting the class there looping through all the words getting the HREF of those links and then making another response getting the content in that response and then…and doing my kind of writing of terms and definitions once I’ve clicked through there.
And so, the result of this is this spreadsheet which uh as you can see I’ve got all these terms and the way I’m working through it is anything red is something I haven’t rewritten the term to and this is essentially how I wrote the book and it was a case of transferring this over to a document and the rest is kind of history there. And like I say this was all brand new to me so I don’t really know what I was doing but it did not take long for me to spin this up it took probably about a couple of hours to kind of pick my way through python and understand it a bit better and as I worked my way through the websites I would kind of learn something new or pick up some some new bit of information and then improve the script there so, I’ll try and include the most up-to-date um cleanest version of this code.
If you do have any suggestions please just let me know you know I’d love to learn more about python. It seems like a really fun easy and clean bit of code to understand. So, thank you for watching, I’ll leave a link to the GitHub repo where all these kind of code is where you can copy it and and use it for yourself in case that’s of interest to you.
If you enjoyed this episode then hit me up with a like and if you want to see more of this kind of stuff then make sure you subscribe. So, all that’s left to say is thank you for tuning in and I’ll see you next time.