Posted by:

David Greenwood

David Greenwood, Chief of Signal

If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly. Please view the post on signalscorps.com for the full interactive viewing experience.

In this post I will show you how file2stix can be used to process RSS feeds, including dealing with partial feeds and historic posts.

In the last post I introduced file2stix.

It is possible to upload HTML files to file2stix for observable extraction. To upload a single blog post you want to extract observables from you can open it up in a browser window, click file, and save page as (html).

When HTML files are uploaded to file2stix, all html tags are removed (so that only printed content is considered by the regular expressions used for extraction in file2stix. For example, if the file contained <a href="URL_INSIDE_HTML_TAG">PRINTED_URL</a>, only PRINTED_URL would be considered by the regular expressions in file2stix.

The downside of this approach is that the same process needs to be completed every-time a new blog post you want to extract observables from. If you follow many threat intelligence focused blogs (of which there are a lot), this flow quickly becomes time consuming.

It is fairly easy to automate this workflow as I will explain. Firstly though, it is important to understand some of the technology used to create blog feeds that can be subscribed too (e.g. by feed readers), RSS and ATOM.

Basics of RSS

RSS stands for Really Simple Syndication. Simply put, RSS is a standardized format using a computer (and human) readable format that shows what has changed for a website, and is especially used by blogs, podcasts, news sites, etc, for this reason.

Here is a sample of an RSS feed from The Record by the Recorded Future team; https://therecord.media/feed/.

Note, in many cases a blog will clearly show their RSS (or ATOM) feed URL, but not all. Whilst not all blogs have RSS feeds, if you open up a browser, navigate to the blog, and click view page source, you can usually find the feed address under the link rel="alternate" type="application/rss+xml" or application/atom+xml HTML tag.

The Recorded Future RSS feed shows all articles from the blog. The Record also allow you to subscribe to RSS feeds by category. For example, to the Cyber Crime category;

https://therecord.media/news/cybercrime/feed/

Generally an RSS feed has an XML structure containing at least the following items;

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">

<channel>
  <title>W3Schools Home Page</title>
  <link>https://www.w3schools.com</link>
  <description>Free web building tutorials</description>
  <item>
    <title>RSS Tutorial</title>
    <link>https://www.w3schools.com/xml/xml_rss.asp</link>
    <description>New RSS tutorial on W3Schools</description>
    <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
  </item>
  <item>
    <title>XML Tutorial</title>
    <link>https://www.w3schools.com/xml</link>
    <description>New XML tutorial on W3Schools</description>
    <pubDate>Tue, 10 Jun 2003 11:34:12 GMT</pubDate>
  </item>
</channel>

</rss>

The <channel> tags capture the entire feed including metadata about the feed (title, link, and description in this case). There are many other optional elements that can be included in the <channel> tags, as defined here.

Each article in the feed is defined inside each <item> tag with sub-elements, generally the most important being:

  • title: The title of the post / article
  • link: The URL of the post / article
  • description: The article content
  • pubDate: The date the article was published

There are many other optional elements that can be included in the <item> tags, as defined here.

Basics of ATOM

Atom is a similar format to RSS and used for the same reasons. It is a slightly newer format than XML (although almost 20 years old) and designed to cover some of the shortcomings of RSS.

Here is a sample of an RSS feed from the 0patch blog.

An ATOM feed has a similar XML structure to RSS, however, you will notice some of the element names are different.

  <?xml version="1.0" encoding="utf-8"?>
   <feed xmlns="http://www.w3.org/2005/Atom">

     <title>Example Feed</title>
     <link href="http://example.org/"/>
     <updated>2003-12-13T18:30:02Z</updated>
     <author>
       <name>John Doe</name>
     </author>
     <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>

     <entry>
       <title>Atom-Powered Robots Run Amok</title>
       <link href="http://example.org/2003/12/13/atom03"/>
       <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
       <published>2003-12-13T18:30:02Z</published>
       <updated>2003-12-13T18:30:02Z</updated>
       <title>Something</title>
       <content>Some text.</content>
     </entry>
   </feed>

The blog information is captured at the top of the document.

Each article in the feed is defined inside each <entry> tag with sub-elements, generally the most important being:

  • title: The title of the post / article
  • id: The UUID of the post
  • link: The URL of the post / article
  • published: The date the article was published
  • content: The article content

There are many other optional elements that can be included in the <item> tags, as defined here.

Dealing with partial content in feeds

The description field (in RSS feeds) and content field (in ATOM feeds) can contain the entirety of the raw article, including the html formatting. You can see this in The Record’s RSS feed.

However, some blogs choose to use snippets in their RSS feed content. For example, choosing only to include the first paragraph - requiring a subscriber to read the full content outside of their feed aggregator.

Here is a sequence diagram showing what a feed reader will typically do when you subscribe to feeds from two web sites, Website 1 and Website 2:

RSS Partial feed

There are various services that can be used to retrieve the full text of a post when a feed contains only partial content. Full Text RSS from FiveFilters is one such service.

Full Text RSS sequence

Full Text RSS is a proxy for partial feeds. So rather than subscribing to the feed from Website 1 directly a new feed is created by Full Text RSS which contains full content for observables to be extracted from.

This can be done using the FiveFilters API and the following parameters:

  • url: the feed url
  • key: your Five Filters API key
  • format: json (the format to be returned)
    • You can also set this to rss xml, but it is easier to parse JSON
  • max: the number of results you want to return. Note, this will not fix the problem of limited feed items
  • summary: 1 (true): this will print a description (short non-html field) and content_encoded field
  • xss: 1 to protect from xss vulnerabilities
  • images: 1 to keep images
  • links: preserve to keep links

Here is an example for The Record (RSS feed) using the Full Text RSS makefulltextfeed endpoint;

http://ftr-premium.fivefilters.org/makefulltextfeed.php?url=https://therecord.media/feed&key=<REDACTED>&format=json&max=5&summary=1&xss=1&images=1&links=preserve

Here is what the response looks like.

Here is an example for 0Patch (ATOM feed);

http://ftr-premium.fivefilters.org/makefulltextfeed.php?url=https://blog.0patch.com/feeds/posts/default&key=<REDACTED>&format=json&max=5&summary=1&xss=1&images=1&links=preserve

Here is what the response looks like.

You can see, regardless of the feed type (either ATOM or RSS), Five Filter parses it into a json document (format=json) with some key fields;

  • item.title
  • item.link
  • item.guid
  • item.description
  • item.content_encoded
  • item.category
  • item.pubDate

You will see the item.content_encoded field values contain the full article, rendered in HTML.

The content inside the item.content_encoded field can then be passed to file2stix as a html document.

Regularly polling the Full Text RSS Feed Conversion API you can get a copy of each new blog post in html format automatically.

To identify new articles since the last poll the item.pubDate can be checked. Any items with an item.pubDate greater than the last poll time and have a unique item.link are considered new and ingested. One of the identified issues with this approach is that it misses updated content where the updated item.link is not unique and thus not ingested.

There is a slight risk that the in the 3 hour polling window more than 5 posts will be posted (we use the parameter max=5 in Full Text RSS). Generally this is most probable on news sites, in which case it is possible to manually update the max field to account for this.

Which brings us onto the second limitation of RSS feeds – history.

Getting feed history

It is common for feeds to only include a limited number of posts. I generally see the last 3 - 5 posts of a blog in a feed.

There is no way to page through historic articles using an RSS or ATOM feed, which means the first poll of the feed will only contain the limited number of articles in the feed. This limit is defined by the blog owner.

There are two ways I came up with to get historic posts from a blog;

  1. Scrape the blog for historic posts. This is the most accurate way to do it, though given the different structure of blogs and websites, this can become complex, requiring a fair bit of manual scraping logic to be written for each blog you want to follow
  2. Use the Wayback Machine’s archive. Often the Wayback Machine will have captured snapshots of a feed (though not always). For example, https://therecord.media/feed/ has been captured 187 times between November 1, 2020 and August 12, 2022.

Whilst the Wayback Machine will completely miss some blog archives, a particular problem for smaller sites that are less likely to be regularly indexed by the WBM), and potentially miss certain feed items where the RSS feed updates faster the WBM re-indexes the site, I chose this approach as it is currently the most scalable way I could come up with to backfill history.

Waybackpack is a command-line tool that lets you download the entire Wayback Machine archive for a given URL for this purpose.

Here is an example of how to use it with The Record Feed;

python3 -m venv waybackpack
source waybackpack/bin/activate
pip3 install waybackpack
waybackpack https://therecord.media/feed/ -d ~/Downloads/therecord_media_feed --from-date 2015 --uniques-only  

In the above command I am requesting all unique feed pages downloaded by the Wayback Machine (--uniques-only ) from 2015 (--from-date 2015) from the feed URL (https://therecord.media/feed/)

Which produces about 100 unique index.html files (where index.html is the actual RSS feed). They are nested in folders named with the index datetime (time captured by WBM) in the format YYYYMMDDHHMMSS like so;


~/Downloads/therecord_media_feed
├── 20220808162900
│   └── therecord.media
│       └── feed
│           └── index.html
├── 20220805213430
│   └── therecord.media
│       └── feed
│           └── index.html
...
└── 20201101220102
    └── therecord.media
        └── feed
            └── index.html

It is important to point out unique entries just mean the index.html files have at least one difference. That is to say, much of the file can actually be the same (and include the same articles).

Take 20220808162900 > therecord.media > index.html and 20220805213430 > therecord.media > index.html

Both of these files contain the same item;

<item>
    <title>Twitter confirms January breach, urges pseudonymous accounts to not add email or phone number</title>
    <link>https://therecord.media/twitter-confirms-january-breach-urges-pseudonymous-accounts-to-not-add-email-or-phone-number/</link>

Essentially on first download of a blog you can Waybackpack and grab all unique <link> elements in the downloaded index.html files.

Once a list of historic URLs have been determined, Full Text RSS can be used again. This time in single article mode.

Now instead of providing a feed URL, I can also pass a single article URL which Full Text RSS will turn in to a feed item.

Here is an example for the link; https://therecord.media/twitter-confirms-january-breach-urges-pseudonymous-accounts-to-not-add-email-or-phone-number/.

For single articles I need to use the Full Text RSS extract endpoint like so;

http://ftr-premium.fivefilters.org/extract.php?url=https://therecord.media/twitter-confirms-january-breach-urges-pseudonymous-accounts-to-not-add-email-or-phone-number/&key=<REDACTED>&xss=1&images=1&links=preserve

Here is what the response looks like.

Full Text RSS parses the response into a json document (format=json) with some key fields.

  • title (maps to item.title)
  • date (maps to item.pubDate``)
  • effective_url (maps to item.link)
  • url (maps to item.link)
  • content (maps to item.content_encoded)
  • excerpt (maps to item.description)

Above I have shown how these are mapped to the posts ingested via the makefulltextfeed endpoint so the historical articles published can be saved in the same way blog posts are.

All that is needed now is to pass historic links collected by WBP one-by-one to the Full Text RSS extract endpoint and ingest the responses to backfill historic blog posts for observable extraction (using the html content in the content field) by file2stix.

Importing a single html page

I showed how the Full Text RSS extract endpoint could be used to backfill posts.

In a similar way, Full Text RSS allows users to upload single web pages to turn them into a feed providing the same output as shown above. This allows automation of any webpage, blog or not, using file2stix ingest.

The biggest benefit of this approach (vs. saving raw html in the browser) is that Full Text RSS attempts to strip all non-content related information (ads, header links, footer links, etc) in the response.

Try file2stix now

file2stix also understands a range of other file formats in addition to html.

To really understand the power of file2stix, take a look at the user documentation here

file2stix is available to download on Github here.




Discuss this post


Signals Corps Slack

Never miss an update


Sign up to receive new articles in your inbox as they published.