Offensive Security Tool: Troll-A

by | Dec 29, 2023 | Tools

Join our Patreon Channel and Gain access to 70+ Exclusive Walkthrough Videos.

Patreon
Reading Time: 2 Minutes

Troll-A

Troll-A by crissyfield is a command line tool for extracting secrets such as passwords, API keys, and tokens from WARC (Web ARChive) files. Troll-A is an easy-to-use, comprehensive, and fast solution for finding secrets in web archives.

 

Features

  • Protocols: Supports retrieving web archives directly from a network server via HTTP/HTTPS, from the Amazon S3 object storage service, or from the local file system.
  • Compression: Supports web archives compressed with GZipBZip2, or ZStd. For ZStd, it also supports custom dictionaries prepended to the compressed data stream (as used by *.megawarc.warc.zst files).
  • Comprehensive: Uses the battle-tested ruleset from the Gitleaks project to detect up to 166 different types of secrets, tokens, keys, or other sensitive information.
  • Performance: Works concurrently and optionally uses optimized regular expressions (via go-re2) to process a typical Common Crawl web archive (~34.000 pages) in less than 30 seconds on AWS c7g.12xlarge.

See Also: So you want to be a hacker?
Offensive Security and Ethical Hacking Course

Installation

Download

Troll-A is available in binary form for macOS and Linux on the releases page.

Building

For better performance, it is recommended to build Troll-A from source, as this allows to use optimized regular expression engine provided by go-re2. For this to work, the RE2 dependency must be installed first.

macOS

# Install dependencies

brew install re2

# Install with RE2 activated

go install -tags re2_cgo github.com/crissyfield/[email protected]

 

Debian / Ubuntu

# Install dependencies

sudo apt install -u build-essential libre2-dev

# Install with RE2 activated

go install -tags re2_cgo github.com/crissyfield/[email protected]

Usage

Usage:
  troll-a [flags] url

This tool allows to extract (potential) secrets such as passwords, API keys, and tokens
from WARC (Web ARChive) files. Extracted information is output as structured text org
JSON, which simplifies further processing of the data.

"url" can be either a regular HTTP or HTTPS reference ("https://domain/path"), an Amazon
S3 reference ("s3://bucket/path"), or a file path (either "file:///path" or simply
"path"). If the data is compressed with either GZip, BZip2, or ZStd it is automatically
decompressed. ZStd with a prepended custom dictionary (as used by "*.megawarc.warc.zstd")
is also handled transparently.

This tool uses rules from the Gitleaks project (https://gitleaks.io) to detect secrets.

Flags:
  -e, --enclosed               only report secrets that are enclosed within their context
  -h, --help                   help for troll-a
  -j, --jobs uint              detect secrets with this many concurrent jobs (default 8)
  -s, --json                   output detected secrets as JSON
  -p, --preset rules-preset    rules preset to use. This could be one of the following:
                               all:         All known rules will be applied, which can
                                            result in a significant amount of noise for
                                            large data sets.
                               most:        Most of the rules are applied, skipping the
                                            biggest culprits for false positives.
                               secret:      Only rules are applied that are most likely
                                            to result in an actual leak of a secret.
                               No other values are allowed. (default secret)
  -q, --quiet                  suppress success message(s)
  -r, --retry retry-strategy   retry strategy to use. This could be one of the following:
                               never:       This strategy will fail after the first fetch
                                            failure and will not attempt to retry.
                               constant:    This strategy will attempt to retry up to 5
                                            times, with a 5s delay after each attempt.
                               exponential: This strategy will attempt to retry for 15
                                            minutes, with an exponentially increasing
                                            delay after each attempt.
                               always:      This strategy will attempt to retry forever,
                                            with no delay at all after each attempt.
                               No other values are allowed. (default never)
  -t, --timeout duration       fetching timeout (does not apply to files) (default 30m0s)
  -v, --version                version for troll-a

 

Examples

 

Common Crawl

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. The Common Crawl corpus contains petabytes of data collected regularly since 2008.

For example, to extract secrets from all of the 3.35 billion pages of the November/December 2023 crawl (called CC-MAIN-2023-50), you can do this:

# Download the list of all 90.000 WARC paths

curl -sSL -O https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/warc.paths.gz

# Iterate through all paths using 64 scanning jobs, output matches as JSON

gzcat warc.paths.gz | \

xargs -I{} -- troll-a -e -s -j64 https://data.commoncrawl.org/{} > secrets.json

 

This will take a long time! Depending on your hardware and Internet connection, this can take anywhere from a week to several months. You may want to run this example only for the first few lines of warc.paths.gz.

 

Internet Archive

The Archive Team is a group dedicated to digital preservation and web archiving founded in 2009. Web archives are stored as WARC files (more specifically, in MegaWARC format) and made available through the Internet Archive.

For example, to extract secrets from the 113.372 pages the Archive Team crawled from pastebin.com in April of 2023 (here’s the corresponding publication on the Internet Archive), you can do this:

# Call troll-a directly with the MegaWARC URL
troll-a -e https://archive.org/download/archiveteam_pastebin_20230421003309_a3b951b4/pastebin_20230421003309_a3b951b4.1603050931.megawarc.warc.zst

 

…which results in…

 

Clone the repo from here: GitHub Link

 

Merch

Recent Tools

Offensive Security & Ethical Hacking Course

Begin the learning curve of hacking now!


Information Security Solutions

Find out how Pentesting Services can help you.


Join our Community

Share This