A simple script approach to import posts from an external feed into a Jekyll blog using HEY World's atom feed as an example. Contains some special workarounds to fix the feed's contents, but should work with every well-formed RSS feed out there with a little customisation.
As seen in my test post, I tried HEY World, a new blogging service by Basecamp that works by sending an email that's then published to the web. While it's not the most powerful blogging tool, it's definitely one that's easy to use.
To keep everything in one place and enjoy basic functionality like syntax highlighting even for the posts made in HEY World, I decided to import the feed they provide into my homepage and fix a few things on the way.
I am aware that this might lead to some SEO issues in case Google considers this to be double content, but I'll cross that bridge when I get there.
Edit: Of course I noticed a few problems just after writing this post, e.g. that tags were no longer extracted in all cases. I made my script more fail-safe and updated the parts below. Here's the post on HEY World to see the script in action.
Here's the latest version of the script, I'll go into some of the problems I ran into that are HEY World specific though. Parsing the feed was done using feedjira which is an amazingly easy to use library for all kinds of XML feeds, so I'll skip this part.
As seen in my test post, I tried HEY World, a new blogging service by Basecamp that works by sending an email that's then published to the web. While it's not the most powerful blogging tool, it's definitely one that's easy to use.
To keep everything in one place and enjoy basic functionality like syntax highlighting even for the posts made in HEY World, I decided to import the feed they provide into my homepage and fix a few things on the way.
I am aware that this might lead to some SEO issues in case Google considers this to be double content, but I'll cross that bridge when I get there.
Edit: Of course I noticed a few problems just after writing this post, e.g. that tags were no longer extracted in all cases. I made my script more fail-safe and updated the parts below. Here's the post on HEY World to see the script in action.
Here's the latest version of the script, I'll go into some of the problems I ran into that are HEY World specific though. Parsing the feed was done using feedjira which is an amazingly easy to use library for all kinds of XML feeds, so I'll skip this part.
Converting the Post Content
Trix, the WYSIWYG editor used by Rails' Action Text and consequently by HEY, produces HTML which is used as entry content in the atom feed. It looks like they just insert the default trix layout which still contains e.g. a surrounding `div.trix-content` and a lot of formatting with a few... interesting choices.
To make this as easy for me as possible, I decided to convert the HTML I get from HEY into markdown and get rid of all HTML that's not inside `pre` tags. There's a great gem called reverse_markdown out there that I already used in the past and in a perfect world, all I'd have to do was a one-liner on the content:
To make this as easy for me as possible, I decided to convert the HTML I get from HEY into markdown and get rid of all HTML that's not inside `pre` tags. There's a great gem called reverse_markdown out there that I already used in the past and in a perfect world, all I'd have to do was a one-liner on the content:
ReverseMarkdown.convert(content, github_flavored: true, unknown_tags: :bypass) # lang: ruby
However, I had to work around a few of Trix's behaviours as well as get some additional information out of the post beforehand:
1. Trix and its Lack of Paragraphs
Other editors usually produce a `<p>` tag for a double newline, Trix goes the literal way and inserts two `<br>` tags. Not wrong, but a bit unexpected and a bit difficult when it comes to markdown conversion.
Let's take a look at the `reverse_markdown` conversion rule for `<br>` tags:
class Br < Base def convert(node, state = {}) " \n" end end # lang: ruby
As you can see, it inserts 2 spaces and a newline which is "stay in the same paragraph, but start a new line" in markdown. So, what happens with Trix's `<br><br>` pattern?
"hello<br><br>world".to_markdown => "hello \n \nworld" # lang: ruby
While kramdown, the default markdown parser in Jekyll, ultimately converts this into paragraphs as it ignores the spaces at the end of the lines, Jekyll seems to have problems generating a post teaser out of this. By default, Jekyll takes the first paragraphs of a post for the index page tiles (depending on their length), but was unable to do so with the double `<br>` tags:
It looks like everything up to the code block was interpreted as a single paragraph (which is correct as explained above). To work around this with as little work as possible, I decided to introduce an own pseudo HTML tag as replacement for the double `<br>`s that I could translate to markdown myself:
It looks like everything up to the code block was interpreted as a single paragraph (which is correct as explained above). To work around this with as little work as possible, I decided to introduce an own pseudo HTML tag as replacement for the double `<br>`s that I could translate to markdown myself:
class DoubleBr < Base def convert(*) "\n\n" end end "hello<doublebr />world".to_markdown #=> "hello\n\nworld" # lang: ruby
2. Headings
While the HEY editor provides a "Heading" style, it seems to only be able to produce `H1` tags which is probably not the semantically best way to go on a website.
I therefore decided to add an own markdown converter again which simply translates `H1` into `##`. Works well and since I know that the input will never contain any other headings, this shouldn't cause problems in the future.
3. Producing proper Code Blocks
While Trix has a basic code block functionality, it doesn't support syntax highlighting, but introduces a few formatting tags like `<em>` into the code.
What it doesn't do is inserting a proper `<code>` tag inside its `<pre>`s which would be the semantically correct way to display a block of code. I understand that they intended to simply provide a "we won't touch whitespace here" functionality into their editor, but this behaviour made it harder to further process it.
First, I had to get rid of the extra formatting HTML that's inserted into the code block. Luckily, the actual code contains twice encoded HTML entities while the formatting tags were only encoded once for the feed:
# A comment class << self # lang: ruby
<em> A comment </em> class &lt;&lt; self # lang: html
This way, it's possible to decode it once and let Nokogiri remove all actual HTML tags while keeping the content intact. The easiest way to achieve this was to use `#content` on the `<pre>` itself which returns the pure text content of said node.
Additionally, I needed to alter two things in the HTML I got from HEY World:
- Add a code tag so the markdown converter understands that this is in fact code
- Extract the language to be highlighted
To achieve 2), I added e.g. `# lang: ruby` to the bottom of code blocks in the HEY World editor and extract these lines in my script.
The resulting part of the script was pretty easy in the end:
xml.css("pre").each do |pre| code = Nokogiri::XML::Node.new("code", xml) # Search for a "lang: something" line at the end of the pre. # If there is one, use it for syntax highlighting if (match = pre.content.lines.last.match(/lang: (\w+)/)) code["lang"] = match[1] code.content = pre.content.lines[0...-1].join end pre.content = nil code.parent = pre end # lang: ruby
I had to extend `reverse_markdown` again with an own converter for `<code>` though as by default it doesn't take the `lang` attribute into account:
class TrixPre < Pre def language(node) super.presence || node.at_css("code")["lang"] end end # lang: ruby
4. Assigning Tags to Posts
Since HEY World itself doesn't support tags, I used the same approach as for the code block highlighting language and search for a tag line ("#tag1 #tag2") at the end of the post. I originally did this before turning the post into markdown, but the HTML I got wasn't consistent enough, so I now treat the last markdown line as possible tags.
def tags @tags ||= begin tag_line = markdown_content.lines.last if tag_line.match?(/^((#\w+) ?)+$/) tag_line.scan(/#(\w+)/).flatten else [] end end end # lang: ruby
If a tag line exists in the markdown content, it is removed before writing it to disk.
Front Matter
The front matter for the Jekyll post (the YAML part at the beginning that contains the post's title and other information) can be easily generated now that all relevant information has been extracted from the RSS entry:
<entry> <id>tag:world.hey.com,2005:World::Post/11280</id> <published>2021-05-14T19:48:10Z</published> <updated>2021-05-23T15:15:38Z</updated> <title>Testing HEY world</title> ... </entry> # lang: xml
def front_matter { date: item.published.localtime.to_s, last_modified_at: item.updated.localtime.to_s, layout: "post", tags: tags, title: item.title }.transform_keys(&:to_s) end # lang: ruby
The important part here is to make sure that all Hash keys are strings, otherwise, a YAML dump would generate Symbols which Jekyll doesn't expect.
File Generation
To keep this as simple as possible, I decided to reload all posts from the external feed every time I run the script, deleting every post that might have already been there. In case of stex.codes, this is done when building its docker container for deployment in CI.
I tried using an own collection (= an own folder) in Jekyll for these posts, but those behave slightly different from normal posts and require a bit more configuration to be included in the main feed. Therefore, I decided to simply add a distinct pattern to their file names while still following Jekyll's naming conventions:
This allows me to remove all possibly existing posts before loading the feed again:
I tried using an own collection (= an own folder) in Jekyll for these posts, but those behave slightly different from normal posts and require a bit more configuration to be included in the main feed. Therefore, I decided to simply add a distinct pattern to their file names while still following Jekyll's naming conventions:
This allows me to remove all possibly existing posts before loading the feed again:
FileUtils.rm_rf(POSTS_DIR.join("*--hey--*")) # lang: ruby
Re-Building the Website on Feed Changes
Since changes to HEY World should be synced over to my blog, I wrote a small script / docker image to watch for changes and trigger a docker hub build. I didn't find a better solution than polling yet, unfortunately.
Conclusion
Well, it works. I'll have to adjust a few styles on the website to e.g. prohibit smaller images being displayed in full width, but creating a blog post through my email client is definitely a lot easier than having to code it. And since I'm still able to create more complex posts directly in Jekyll, I'll go with this for all simpler topics for now.
#hey #jekyll #ruby