After getting injured on my first day of skiing, I had a bit more time on my hands than anticipated over the holidays.
I noticed that this blog has an RSS feed, but it's only for recent items. I've also been blogging across many different platforms over many years, and didn't have an easy way to see all of the posts that I've written.
I used a combination of LLMs to:
- read the RSS feeds of the blogs I have posted that support full RSS.
- scrape the pages of Hey to get a full list of posts.
- do some fuzzy matching on post titles, and the markdown content I have locally for my old blogs.
- scrape the full text from Hey.
- put all of these into a local Sqlite DB.
- visualise that in Datasette.
- create some scripts to read from the local DB to create a single view of all of my posts.
I've now added a sup page to my Homepage https://mulvany.net/all-my-posts.html
I can see that I duplicated most, but not all of my content across scholarly comms product blog, and partiallyattended - so I have a lot of duplicate content in the DB.
Some, actually many, of my posts are very low quality.
It was interesting that I'd not noticed that until I had a page where I could easily see all of the posts in one place.
I lost metadata on the posts from my early blogging days, which is why there is a bit jump from 2006 to 2008.
The page which lists all of the blog posts won't include this post for a while because I still need to setup some automation to make that happen, I'm going to see if I can get that working with GitHub actions.
Now that I have all of the content in one place I aim to try some classification and clustering, but for now I'm quite happy to have pulled this together.
The code that I wrote to do all of this is very messy, but the iteration loop to get the job done with LLMs (I used Cursor, GPT, and Claude) was effective. I'd consider this a write once, don't reuse, project. In any case I've posted the code here (after getting GPT to write the README by passing it all of the python files to look at). https://github.com/IanMulvany/hey-world-blog-tools
I noticed that this blog has an RSS feed, but it's only for recent items. I've also been blogging across many different platforms over many years, and didn't have an easy way to see all of the posts that I've written.
I used a combination of LLMs to:
- read the RSS feeds of the blogs I have posted that support full RSS.
- scrape the pages of Hey to get a full list of posts.
- do some fuzzy matching on post titles, and the markdown content I have locally for my old blogs.
- scrape the full text from Hey.
- put all of these into a local Sqlite DB.
- visualise that in Datasette.
- create some scripts to read from the local DB to create a single view of all of my posts.
I've now added a sup page to my Homepage https://mulvany.net/all-my-posts.html
I can see that I duplicated most, but not all of my content across scholarly comms product blog, and partiallyattended - so I have a lot of duplicate content in the DB.
Some, actually many, of my posts are very low quality.
It was interesting that I'd not noticed that until I had a page where I could easily see all of the posts in one place.
I lost metadata on the posts from my early blogging days, which is why there is a bit jump from 2006 to 2008.
The page which lists all of the blog posts won't include this post for a while because I still need to setup some automation to make that happen, I'm going to see if I can get that working with GitHub actions.
Now that I have all of the content in one place I aim to try some classification and clustering, but for now I'm quite happy to have pulled this together.
The code that I wrote to do all of this is very messy, but the iteration loop to get the job done with LLMs (I used Cursor, GPT, and Claude) was effective. I'd consider this a write once, don't reuse, project. In any case I've posted the code here (after getting GPT to write the README by passing it all of the python files to look at). https://github.com/IanMulvany/hey-world-blog-tools