Posted by:

David Greenwood

David Greenwood, Chief of Signal

If you are reading this blog post via a 3rd party source it is very likely that many parts of it will not render correctly. Please view the post on signalscorps.com for the full interactive viewing experience.

In this post I will show you some of my early attempts to try and productise what OpenAI can do to aid the work of Intelligence Analysts.

I’d make a fair assumption you too have played with chat.openai.com over the last few months.

Before I continue, I do want to take all credit for this post and let you know it is not generated by OpenAI (although believe me, I tried).

Though beyond writing blog posts, GPT-3 clearly has huge potential for the operational side of threat intelligence. For a long time I’ve had it in the back of my mind as potentially being a very useful addition to file2stix logic.

Currently file2stix uses regular expressions to extract observables from reports. As discussed in this post, this is fine, but misses lots of contextual information that could also be captured as observables.

I’ve also talked about this problem in my post about MITRE TRAM previously, so I was interested to see how GPT-3 compares.

Here is how my initial research went…

Observable Extraction

I wanted to start simply; how would GPT-3 perform when asked to extract indicators of compromise (observables) from text?

Input:

Please extract all the indicators of compromise from the following text: The first IP contains a historical certificate reference to portal-Share.mfa[.]new, which suggests an ambiguous “Ministry of Foreign Affairs (MFA)” nexus. The second is a virtual private server (VPS) owned by The Constant Company. This second IP (158.247.222[.]6) hosted the domain www[.]delldrivers[.]in from July 7, 2022 to Oct. 11, 2022. This domain is associated with a Turian backdoor sample.

Source.

Output:

  • IP: portal-Share.mfa[.]new
  • IP: 158.247.222[.]6
  • Domain: www[.]delldrivers[.]in
  • Timeframe: July 7, 2022 to Oct. 11, 2022
  • Malware: Turian backdoor sample

Not bad.

Let’s see how GPT-3 deals with fanged data.

Input:

Now please remove the fangs from these indicators of compromise

Output:

  • portal-Share.mfa.new
  • IP: 158.247.222.6
  • Domain: www.delldrivers.in
  • Timeframe: July 7, 2022 to Oct. 11, 2022
  • Malware: Turian backdoor sample

Nice! Now I’ll try and structure these indicators as STIX 2.1 Objects.

Input:

Finally, please turn these indicators of compromise into STIX 2.1 objects

Output:

I’m sorry, I am not able to turn these indicators of compromise into STIX 2.1 objects. STIX (Structured Threat Information eXpression) is a language used to represent cyber threat intelligence (CTI) data and is used to share information between different security systems. It would require a specific software or tool that supports STIX 2.1 to create those objects and also the knowledge of STIX 2.1 language.

No biggie, that’s actually the easiest part here and something file2stix could do quite easily from the previous output.

It would appear GPT-3 is pretty good at extraction of indicators of compromise, but what about more nuanced references to threat intelligence concepts in reports.

MITRE ATT&CK

TRAM, mentioned earlier, is built to classify sentences in reports as MITRE ATT&CK data. How will GPT-3 compare?

I’ll start with an easy example with ATT&CK Tactic and Techniques are explicitly mentioned in the text provided.

Input:

Please extract all the ATT&CK techniques from the following text: In February 2022, the threat actors exploited Log4Shell [T1190] for initial access [TA0001] to the organization’s unpatched VMware Horizon server. As part of their initial exploitation, CISA observed a connection to known malicious IP address 182.54.217[.]2 lasting 17.6 seconds.

Output:

  • T1190: Log4Shell
  • TA0001: Initial access

GPT-3 understand ATT&CK – we’re off to a good start.

Input:

Please explain these ATT&CK Tactics and Techniques in more detail

Output:

  • T1190: Log4Shell is a technique used by attackers to gain access to a system by exploiting a vulnerability in the logging component of a target system. This technique could be used to gain access to a system by injecting malicious code into log files or by manipulating log data to gain access to sensitive information.
  • TA0001: Initial access refers to the means by which attackers initially gain access to a system or network. This could include exploiting a vulnerability in software or hardware, phishing or social engineering tactics, or other means of gaining access to a system without the knowledge or consent of the user. In this example, the attackers used the Log4Shell technique to gain initial access to the organization’s unpatched VMware Horizon server.

The last sentence is the most impressive here as it actually references how the Tactic was used in this incident!

One missing piece in the response above is where these ATT&CK Tactics and Techniques were referenced in the report, something that will be important later when an analyst is researching the threat.

Input:

In the original text, can you markup the ATT&CK Tactics and Techniques using square brackets please?

Output:

In February 2022, the threat actors exploited [T1190: Log4Shell] for initial access [TA0001: Initial access] to the organization’s unpatched VMware Horizon server. As part of their initial exploitation, CISA observed a connection to known malicious IP address 182.54.217[.]2 lasting 17.6 seconds.

And a bit more…

Input:

And also provide the link to the MITRE ATT&CK website in markdown format please?

Output:

In February 2022, the threat actors exploited T1190: Log4Shell for initial access TA0001: Initial access to the organization’s unpatched VMware Horizon server. As part of their initial exploitation, CISA observed a connection to known malicious IP address 182.54.217[.]2 lasting 17.6 seconds.

Let’s now try another example where the input has less explicit references to ATT&CK Tactics and Techniques.

Input:

Please extract all the ATT&CK techniques and their IDs from the following text: Tying this all together, we identified Iranian government infrastructure establishing connections with a known Playful Taurus command and control (C2) server. Pivoting on one of the Iranian government IPs, we then identified additional infrastructure hosting certificates that overlap with a second Playful Taurus C2 server.

Output:

The MITRE ATT&CK framework categorizes techniques used by adversaries into different tactics, and each technique is given a unique ID.

  • Command and Control (C2) communication: T1043
  • Pivoting: T1575
  • Infrastructure manipulation: T1585
  • Certificates overlap with a second C2 server: T1546 Please note that these techniques are not directly linked to the mentioned threat actor, but these are techniques that can be used by any threat actor.

I’d agree with the output here (though I’m not an Intel Analyst!).

However, the last sentence is important. You almost need to feed GPT-3 sentence or paragraphs only. In a much larger report, the Tactics and Techniques might be specific to numerous actors being discussed (hence why marking up the input as shown before becomes useful).

Lookup

Let’s now try to enrich the report data with more external context.

Input:

Please identify the threat actors in the following text: Tying this all together, we identified Iranian government infrastructure establishing connections with a known Playful Taurus command and control (C2) server. Pivoting on one of the Iranian government IPs, we then identified additional infrastructure hosting certificates that overlap with a second Playful Taurus C2 server.

Output:

The threat actors in this text are the Iranian government and the group known as Playful Taurus.

Input:

What Malware is associated with the Iranian government actors?

Output:

Iranian government actors have been associated with several types of malware, including:

  • Shamoon: A destructive malware that was used in 2012 to target Saudi Arabian oil companies.
  • Dustman: A wiper malware that was used in 2016 to target organizations in the Middle East.
  • OilRig: A cyber-espionage group that has been active since at least 2015 and has been linked to Iran. The group has been known to use a variety of tools and techniques, including malware and spear-phishing.

This answer does confuse Malware and Groups. I will try again;

Input:

Some of the entries in this list are not malware, please only show malware

Output:

Sure, here is a list of malware that have been associated with Iranian government actors:

  • Shamoon: A destructive malware that was used in 2012 to target Saudi Arabian oil companies. It overwrote the master boot record (MBR) of infected computers, rendering them inoperable.
  • Dustman: A wiper malware that was used in 2016 to target organizations in the Middle East. It is used to delete files and render systems inoperable.
  • APT33 (also known as Elfin) : A sophisticated malware toolkit that is used for cyber espionage. APT33 has been known to target aerospace, energy, and petrochemical industries.
  • APT34 (also known as Helix Kitten and OilRig) : APT34 is a cyber-espionage group that has been known to use malware and spear-phishing to target organizations in the financial, government and energy sector.
  • CopyKittens: A cyber-espionage group that was active since 2013 and known to use malware and spear-phishing to target organizations in the middle east, including Israel and Palestine.

Slightly better, but not perfect.

Let me dig into one of these entries.

Input:

In the MITRE ATT&CK framework, what is the software object ID of the software Shamoon? Please also link to the MITRE ATT&CK website for this entry in your answer.

Output:

I’m sorry, but Shamoon malware is not included in the MITRE ATT&CK framework, which is a curated knowledge base and model for cyber adversary behavior, used to enhance the understanding of an adversary’s tactics, techniques, and procedures (TTPs). The MITRE ATT&CK framework is a resource that can be used to understand the TTPs of various types of cyber adversaries, but it is not exhaustive and does not include every known threat or piece of malware.

Which I know is incorrect, so I will inform GPT-3;

Input:

That is not correct. See Shamoon is listed here: https://attack.mitre.org/software/S0140/

Output:

I apologize for the previous incorrect statement, you are correct that Shamoon malware is included in MITRE ATT&CK framework. The software object ID of Shamoon malware in the MITRE ATT&CK framework is: S0140 You can find more information about the Shamoon malware on the MITRE website, here is the link: https://attack.mitre.org/software/S0140/

Automating everything

So far I’ve only been using the web UI at chat.openai.com to generate these responses.

GPT-3 comes with an API that can be used for this purpose too, specifically the Text Compilation endpoint.

The completions endpoint can be used for a wide variety of tasks. It provides a simple but powerful interface to any of our models. You input some text as a prompt, and the model will generate a text completion that attempts to match whatever context or pattern you gave it.

All you need is an API key (create it here) to start making requests as I will show you.

First let me setup my environment;

python3 -m venv openai_tutorial
source openai_tutorial/bin/activate
pip3 install openai

Now I’ll create the script called ioc_extract_prompt.py.

In this I will use the openai.Completion.create() function with the following parameters:

import openai

# Replace YOUR_API_KEY with your OpenAI API key
openai.api_key = "YOUR_API_KEY"

# Set the model and prompt
model_engine = "text-davinci-003"
prompt = "Please extract all the indicators of compromise from the following text. The first IP contains a historical certificate reference to portal-Share.mfa[.]new, which suggests an ambiguous “Ministry of Foreign Affairs (MFA)” nexus. The second is a virtual private server (VPS) owned by The Constant Company. This second IP (158.247.222[.]6) hosted the domain www[.]delldrivers[.]in from July 7, 2022 to Oct. 11, 2022. This domain is associated with a Turian backdoor sample."

# Set the maximum number of tokens to generate in the response
max_tokens = 2048

# Generate a response
completion = openai.Completion.create(
    engine=model_engine,
    prompt=prompt,
    max_tokens=max_tokens,
    temperature=0.5,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)

# Print the response
print(completion.choices[0].text)

Where I set the following:

  • model_engine: The name of the model you want to use. We will be using the text-davinci-003 model for this tutorial. This is the most capable GPT-3 model (see them all here). Note, during this exercise I realised that the latest Chat GPT model isn’t available yet via the OpenAI API.
  • prompt: The prompt or context for the conversation. This can be a single line of text or a multi-line prompt separated by newlines. Put another way, this is where you put the questions being asked earlier in this post.
  • max_tokens: The maximum number of tokens (words and punctuation) to generate in the response. The minimum is 1 and the maximum is 2048.

There are few more parameters that I use in the above request, but I will let you figure out what they are used for in the OpenAI documentations should you want to tweak it yourself.

Now I will run my script;

python3 ioc_extract_prompt.py

Which prints;

Indicators of Compromise: 
1. portal-Share.mfa[.]new 
2. 158.247.222[.]6 
3. www[.]delldrivers[.]in 
4. Turian backdoor sample

Again, the query format will need some tuning, but even using the older model, simple indicator extraction is working as expected.

In the coming months I’m going to try and fine tune the queries for extraction of intelligence so that OpenAI can be used in file2stix.

In the meantime, the above should be enough to get you started on your own endeavors!

Oh and one final word of warning, keep track of the amount of free credits you are using up during development

For English text, 1 token is approximately 4 characters or 0.75 words.

At the time of writing 1,000 tokens are charged at $0.0200 / 1k tokens (750 words). You get $18 of free credits (good for 675,000 words) – but don’t be complacent, this quota can be very quickly used up (as I have found!).




Discuss this post


Signals Corps Slack

Never miss an update


Sign up to receive new articles in your inbox as they published.