From Shadows to Signals: Building a Cyber Threat Intelligence Pipeline Over Telegram at Scale

Authorized research only. Everything described here was conducted in a controlled environment against data I had lawful authorization to analyze. Golliath does not facilitate unauthorized access, credential use, or victim contact of any kind.

Every day, hundreds of Telegram channels publish stealer logs, archives packed with credentials, cookies, and system fingerprints from compromised machines. Most teams consume this data passively, via commercial feeds, by the time the intelligence is already stale. I wanted to get closer to the source: build something that scraped channels in real time, parsed the archive formats properly, and gave me a structured, searchable view of what was being shared.

That project became Golliath, a six-subsystem platform written in Go and Python, backed by Kafka, OpenSearch, Neo4j, MinIO, and PostgreSQL, with a Next.js frontend for exploration. This post is a deep dive into how it works, why it's built the way it is, and what the numbers look like after running it against a real-world corpus.

1. Why Telegram? Why Now?

Telegram has become the dominant distribution channel for activity and stealer-log marketplaces. The reasons are structural: channels support large file attachments (up to 4 GB), messages are persistent and indexed by Telegram's own servers, and invite-link-based access creates a discoverable social graph of threat actor infrastructure.

Unlike paste sites or dark-web forums, Telegram does not require Tor and responds well to programmatic access via MTProto. The channel topology itself is intelligence: who forwards whom, which channels share an admin, how a new log pack propagates across the ecosystem within hours of publication.

The challenge is volume. A single active channel might post dozens of archives per day, and a monitoring operation watching 50+ channels accumulates data faster than any manual triage workflow can handle. The only way to turn that volume into signal is automation, and automation requires understanding the archive formats.

2. Golliath at a Glance

Golliath is not a single application. It is six loosely-coupled subsystems that share storage and communicate through Kafka and HTTP:

Subsystem	Language	Role
data_lake	Go + Python	Collection, scraping, file download, event streaming, indexing
parser_logs	Python	Archive parsing pipeline: stealer family grammars, field extractors, NLP enrichment
intelligence-analysis	Python	ML benchmarking, IOC extraction, threat classification, clustering
frontend	Next.js 14	Dashboards, search, graph explorer, source management
infra	Docker Compose	Kafka KRaft, OpenSearch, MinIO, Neo4j, Redis, PostgreSQL, Prometheus, MLflow
sdk	Go	Shared HTTP client, job types, retry logic

Golliath Platform Architecture

Click any service to inspect its role, implementation, and data outputs.

Source

Collection

Transport

Processing

Storage

The collection subsystem (data_lake) is the entry point. It scrapes channels, downloads files, and publishes events. Everything downstream (parser, analysis, frontend) consumes from shared storage and search indices. This separation means I can run the parser against the same corpus multiple times as the grammar logic evolves, without re-scraping anything.

3. Source Coverage

Before building any parser, you need data, and that means maintaining a curated source registry. Golliath currently monitors 643 Telegram sources across ten categories:

Category	Sources	Description
logs	325	Stealer-log channels, archive dumps, credential shares
chatter	108	General cybercrime discussion, threat actor communication
ULP	55	URL:login:password combo lists
combos	35	Pre-formatted credential stuffing lists
actors	33	Known threat actor personal channels
leaks	26	Database breach announcements and dumps
marketplaces	26	Credential sale listings
infrastructure	17	C2, bulletproof hosting, proxy sale channels
exploit-sharing	13	PoC and weaponized exploit distribution
malware	5	Malware distribution and update channels

Regional breakdown: the majority are tagged global (502), with significant Russian-language coverage (RU: 44), Middle East and North Africa (MENA: 18), and Southeast Asia (SEA: 6). Each source carries a scrape priority (1-10), a configurable interval, and per-source file-download flags controlling whether audio, documents, images, or videos are fetched.

4. Scraping at Scale

4.1 The Session Pool

Telegram's MTProto API does not expose a simple REST endpoint; you authenticate as a user account (or bot) and maintain a session. The session-manager service (Python/Telethon) maintains a pool of authenticated sessions and exposes a small HTTP API that the Go services call.

Scraping a channel means calling iter_messages with a checkpoint cursor:

async for message in client.iter_messages(
    entity,
    min_id=checkpoint_id,
    reverse=True,
    limit=3000,
):
    await publish_to_worker(message)

reverse=True walks forward from the checkpoint, so resuming a scrape after a restart does not re-process old messages. The checkpoint is stored in PostgreSQL and updated atomically after each batch commits to Kafka.

File downloads use the same session pool. Telethon streams file data in chunks; the session-manager relays those chunks to the download-worker over HTTP. I settled on 512 KB chunks after testing: larger chunks caused session timeouts on slow connections; smaller chunks introduced too much HTTP overhead per file.

4.2 Job Dispatch with SKIP LOCKED

The download-worker claims scrape jobs from PostgreSQL using a pattern that avoids distributed locks:

SELECT id, source_id, channel_identifier
FROM scrape_jobs
WHERE status = 'pending'
  AND scheduled_at <= NOW()
ORDER BY priority DESC, scheduled_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1;

SKIP LOCKED means multiple workers can race for jobs without blocking each other. Jobs that are locked by another worker are transparently skipped. This gives horizontal scalability for free: spin up more download-worker replicas and the queue drains faster with zero coordination overhead.

4.3 Kafka and File Priority Tiers

Once a message is fetched, the download-worker publishes it to the messages.discovered Kafka topic. Files attached to that message get routed to one of two priority topics:

files.priority: text files (credential dumps, plain-text logs)
files.archives: ZIP, RAR, 7z, TAR

The file-worker consumes these topics in priority order. Text files land in MinIO's txt-dumps/ bucket for immediate parsing; archives go to archives/; executables and unknown formats are quarantined. SHA-256 is computed at download time and stored in PostgreSQL; duplicate files are detected before they are written, keeping the bucket clean.

4.4 Hunting Invite Links

Telegram invite links (t.me/+HASH) appear constantly in channel messages: operators cross-post content, advertise related services, and link to downstream distribution channels. The message-indexer extracts every invite link from indexed messages using a regex that covers all common patterns:

var reInviteLink = regexp.MustCompile(
    `(?i)https?://(?:t\.me|telegram\.me|telegram\.dog)/(?:\+|joinchat/)([A-Za-z0-9_\-]{10,})`)

Matched hashes are stored in a PostgreSQL invite_links table with their source context. The frontend's /hunt view resolves these hashes via CheckChatInviteRequest, Telegram's RPC for probing private channel metadata without joining. This surfaces channel title, member count, description, and admin signatures from a hash alone.

The result is a ranked expansion surface: a small channel under monitoring often contains invite links to a much larger aggregator or supply channel that was never manually added to the source registry. One invite-link pivot in the right direction can add 10 new monitored sources.

5. Message Indexing and LLM-Assisted Channel Categorization

5.1 Real-Time Indexing

The message-indexer service (Go) consumes the messages.discovered Kafka topic and writes every message to OpenSearch. At 1,958.7 messages/second, new content from any monitored channel is searchable within seconds of being published.

After 234,757 indexed messages across the monitored source set, the OpenSearch index becomes a live intelligence feed: analysts can query for a specific domain, an IP range, a malware family name, or a channel handle and get ranked results with full message context.

5.2 Sampling and LLM Pre-Labeling

Manually reviewing hundreds of channel messages to decide a source's correct category does not scale. Instead, Golliath uses a corpus-first approach:

Sample a random subset of messages from each unclassified source (typically 20-50 messages)
Pass the sample to Gemini 2.5 Flash for pre-labeling: the model extracts structured fields (URL, USERNAME, PASSWORD, DOMAIN, SUBDOMAIN) from each line
Human analyst reviews the labels in Doccano, correcting errors
The labeled dataset trains a Gradient Boosting Classifier (85/15 train/test split, confidence threshold 0.70) used to auto-categorize future sources

The prompt instructs Gemini to return a strict JSON array with no markdown, handling all four common credential formats: email:password, url:user:pass, url|user|pass, and variations where passwords contain colons (joined as trailing fields). Labels for the hardest 1,000 cases were fully reviewed before being committed to the training set.

This approach reduced the manual categorization burden by ~80% compared to reading raw channel content. The residual 20% (sources where the classifier confidence falls below threshold or the content is ambiguous) are flagged for human review.

6. The Archive Problem

Open any stealer-log archive and you will find a surprisingly consistent internal structure. The stealer client runs on the victim machine, collects credentials from browsers, writes everything to a local directory, and then compresses it before exfiltrating to the C2 server or dropping it directly to a Telegram channel.

Each stealer family has evolved its own layout convention. A Lumma archive looks like this:

LummaC2_2024_01_15_SESSION_12345/
├── domain_detect.txt          <- family fingerprint
├── Passwords/
│   ├── Chrome_Default.txt
│   └── Firefox.txt
├── Cookies/
│   ├── Chrome_Default.txt
│   └── network.txt
├── Autofill/
│   └── Chrome_Default.txt
└── System Info.txt

A WhiteSnake archive looks entirely different:

System_HOSTNAME_WIN10PRO/
├── Browsers/
│   ├── Passwords/
│   └── Cookies/
├── Cookies/                   <- root-level cookies too
├── BankCards.txt
└── System.txt

These are not cosmetic differences. A parser that assumes Lumma layout on a WhiteSnake archive will misroute the password files and produce nothing. Before I could write a single extractor, I needed a way to identify the family and select the right parsing logic.

7. The Parser Pipeline

The parser_logs subsystem implements a six-stage pipeline. Archives go in, structured JSON records come out.

parser_logs Processing Pipeline

Six-stage pipeline from raw archive to enriched structured intelligence. Click any stage.

Walker

Recursive Traversal

Recursively walk the extracted archive tree. Enforce safety limits before touching anything.

Output

File path + metadata stream

Implementation

-Extracts ZIP/RAR/7z/TAR with format-specific decompressors
-Archive bomb guard: ratio > 20x above 1 MB — abort
-Entry count cap: > 100,000 entries — abort
-Nested archive depth: max 3 levels
-Zip-slip protection: resolves all symlinks before extraction
-Emits { path, size, mtime } records for classifier

7.1 Safety Before Parsing

The walker enforces hard limits before extracting a single byte of content. This matters because attackers have started distributing archive bombs (intentionally malformed archives that expand to hundreds of gigabytes) and zip-slip payloads that try to write outside the extraction directory.

ARCHIVE_BOMB_RATIO = 20        # compressed:uncompressed > 20x -> abort
ARCHIVE_BOMB_MIN_SIZE = 1_048_576  # only apply ratio check above 1 MB
MAX_ENTRY_COUNT = 100_000
MAX_NESTED_DEPTH = 3

Zip-slip protection resolves all paths relative to the extraction root and rejects any entry whose resolved path escapes it:

def _safe_extract_path(archive_root: Path, entry_name: str) -> Path:
    target = (archive_root / entry_name).resolve()
    if not str(target).startswith(str(archive_root.resolve())):
        raise ZipSlipError(f"Attempted path traversal: {entry_name}")
    return target

These are not theoretical risks. Zip-slip payloads appear in the wild corpus.

7.2 Layout Classification and Grammar Selection

After safe extraction, the layout classifier surveys the directory tree without reading file contents. It looks for family-specific structural signatures:

domain_detect.txt or domaindetect.txt at root: Lumma signal
Network/ containing UserName_* subdirectories: RisePro signal
System_HOSTNAME/ directories at root: WhiteSnake signal
log.txt at root with a Passwords/ sibling: RedLine signal

These signals are passed to the GrammarRouter, which calls matches() on every registered grammar and picks the highest scorer above 0.60:

class GrammarRouter:
    def __init__(self, grammars: list[FamilyGrammar]) -> None:
        self._grammars = sorted(grammars, key=lambda g: g.priority, reverse=True)

    def route(self, layout: ArchiveLayout) -> FamilyGrammar:
        scores = [(g, g.matches(layout)) for g in self._grammars]
        best_grammar, best_score = max(scores, key=lambda x: x[1])
        if best_score >= CONFIDENCE_THRESHOLD:
            return best_grammar
        return self._generic  # unconditional fallback

Grammar Router — Confidence Competition

Each FamilyGrammar scores the layout evidence. Highest above 0.60 wins. Explore family signatures or simulate your own archive layout.

WhiteSnakeGrammar

confidence

0.95

Score

Above 0.60 threshold — wins routing

Primary Trigger

System_* + Cookies/

Example

WhiteSnake Stealer aggregator repacks

Evidence scored by matches()

- System_<HOSTNAME>/ directories in root
- Cookies/ directory at root level
- BankCard*.txt files present

Fields extracted

passwords (colon-separated)

autofill

credit cards

The threshold of 0.60 was chosen empirically: below it, the grammar was wrong more often than the generic fallback would be. Tie-breaking favors more-specific grammars; GenericGrammar always returns 0.0, so it never wins a competition.

8. Family Grammars

Every stealer family gets its own FamilyGrammar subclass. The abstract base defines a simple contract:

FamilyGrammar ABCpython

8.1 LummaGrammar

Lumma is the dominant stealer-as-a-service family in the current corpus. Its archives are identified by domain_detect.txt, a list of domains that the stealer was configured to target, one per line.

LummaGrammar.matches()python

The domain_detect.txt file is also intelligence in itself: it reveals which domains the affiliate configured the stealer to prioritize, often including specific banking, crypto exchange, and corporate SSO domains.

8.2 RiseProGrammar

RisePro organizes each victim session into a named subdirectory under Network/. The directory name encodes a username hash: Network/UserName_a3f9b2/. This layout is distinctive enough that the classifier scores it at ~0.88.

RiseProGrammar.matches()python

8.3 WhiteSnakeGrammar

WhiteSnake is the highest-confidence detection at 0.95 when both signals co-occur: System_HOSTNAME/ directories (one per victim session in aggregator repacks) and a root-level Cookies/ directory.

8.4 GenericGrammar (The Fallback)

When no family grammar scores above 0.60, GenericGrammar takes over. It runs all four password format parsers in cascade and tries the best-effort cookie parser. Coverage is lower than a matched grammar, but it is still far better than discarding the archive entirely.

The fallback handles roughly 15% of the corpus in practice, mostly aggregator repacks that strip or rename the family-specific files before redistribution.

9. Field Extractors

Once the grammar routes the archive, per-role extractors run against the classified file list. The most complex is the password extractor, which handles four distinct formats across stealer families.

9.1 Password Format Cascade

Passwords are parsed in priority order, stopping at the first format that yields results:

1. Key:value blocks (Lumma, some RedLine variants):

URL: https://example.com/login
UserName: alice@corp.com
Password: hunter2

URL: https://bank.example/auth
...

2. TSV format (RisePro):

https://mail.google.com	alice@gmail.com	••••••••
https://github.com	alice	gh_token_abc123

3. Pipe-separated (RedLine v22+):

https://twitter.com/|alice|P@ssw0rd!

4. Colon-separated (WhiteSnake, generic):

https://corp.okta.com:alice@corp.com:S3cur3P@ss

Each format parser is implemented as a standalone function that returns list[PasswordRecord] | None. The cascade tries them in order and short-circuits on the first non-empty result. If all four fail, the file is logged as unparseable and added to the backlog.

9.2 Cookie Parsing

Cookies follow the Netscape 7-column TSV format that browsers export. The parser enforces strict column counts; the most common corruption in wild archives is lines with missing tab separators, which a lenient parser would silently misparse into wrong fields.

NETSCAPE_COLUMNS = 7

def parse_netscape_cookies(text: str) -> list[CookieRecord]:
    records = []
    for line in text.splitlines():
        if line.startswith("#") or not line.strip():
            continue
        cols = line.split("\t")
        if len(cols) != NETSCAPE_COLUMNS:
            continue  # log and skip, do not attempt partial parsing
        domain, include_subdomains, path, secure, expiry, name, value = cols
        records.append(CookieRecord(
            domain=domain.lstrip("."),
            path=path,
            name=name,
            value=value,
            secure=secure.upper() == "TRUE",
            http_only=False,
            expires=int(expiry) if expiry.isdigit() else None,
        ))
    return records

The 14.9 million cookies in the benchmark corpus made this the highest-volume extractor by a wide margin. Session cookies are the primary target: a valid session cookie bypasses MFA entirely.

9.3 System Info and the Alias Map

Every stealer writes a system fingerprint file, but the field names are not standardized. Lumma calls it System Info.txt with IP: 1.2.3.4 lines; WhiteSnake uses System.txt with Ip = lines; RedLine uses log.txt with IP Address: lines.

The system info extractor maintains an alias map:

FIELD_ALIASES: dict[str, list[str]] = {
    "hostname": ["computer name", "hostname", "pc name", "computername"],
    "ip": ["ip", "ip address", "external ip", "wan ip"],
    "os": ["os", "operating system", "windows version"],
    "hardware": ["hardware", "cpu", "gpu", "ram", "processor"],
    "country": ["country", "location", "geo"],
    "hwid": ["hwid", "machine id", "hardware id", "uuid"],
}

Normalization means downstream queries (WHERE system_info.country = 'US') work regardless of which stealer family produced the record.

10. NLP Enrichment

After extraction, every session passes through the NLP enrichment pipeline. Two things happen here: extraction and named-entity recognition on the free-text fields.

10.1 IOC Extraction and IntelOwl Integration

IOC extraction uses compiled regex patterns with domain TLD validation:

IOC_PATTERNS = {
    "domain":  re.compile(r"\b(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}\b"),
    "ipv4":    re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b"),
    "url":     re.compile(r"https?://[^\s\"'<>]+"),
    "wallet":  re.compile(r"\b(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,62}\b|0x[a-fA-F0-9]{40}"),
    "cve":     re.compile(r"CVE-\d{4}-\d{4,7}", re.IGNORECASE),
}

The domain extractor runs every match through a TLD allowlist before accepting it. This eliminates the long tail of false positives from free-form text; words like example.log or victim.txt look like domains to a naive regex but are discarded after TLD rejection.

High-priority IOCs (IPs and domains that appear across multiple sessions or match watchlist entries) are optionally enriched via IntelOwl, an open-source threat intelligence orchestration platform that aggregates reputation data from VirusTotal, AbuseIPDB, Shodan, and dozens of other sources. The integration is fire-and-forget: IOCs above a configurable session-frequency threshold are submitted to IntelOwl's REST API, and results are written back to the sessions index as enrichment fields. This turns raw IOC extraction into enriched, context-bearing threat data without blocking the extraction pipeline.

10.2 Named Entity Recognition

The enrichment pipeline supports three NER configurations:

Model	F1	Speed	Best for
DistilBERT-NER	0.81	Fast	Default; English stealer logs
CyNER	0.74	Fast	CTI-specific entity types (TTP, malware)
XLM-RoBERTa	0.76	Slow	Non-English archives (Russian, Turkish)

CyNER was developed specifically for cybersecurity text and handles entity types that general NER models miss, but its overall F1 on the benchmark was slightly below DistilBERT. XLM-RoBERTa is the right choice for archives from Eastern European operator groups where the system info and file paths contain Cyrillic text.

10.3 Session Deduplication via HDBSCAN

Aggregators routinely repackage the same victim sessions under different branding. A session that appeared in three separate archive releases would count three times without deduplication.

clustering on multilingual embeddings catches near-duplicate sessions: if two sessions have the same hostname, IP, and browser profile fingerprint, they cluster tightly regardless of the archive they came from. The benchmark showed ~12% of sessions in the corpus were duplicates by this measure.

Golliath Benchmark Results

Numbers from a single pipeline run on a real-world corpus.

314,717

Credentials

URL + username + password

14.9 M

Netscape-format records

10.5 MB/s

Throughput

end-to-end archive parsing

277,965

IOCs extracted

from 234,757 messages

28,409

Autofill records

form field name/value pairs

1,958 msg/s

Msg indexing

IOC extraction baseline

Credentials314,717

Cookies14,900,000

Autofill records28,409

Credit cards1,204

Throughput10.5 MB/s

Msg indexing1,958.7 msg/s

IOC pipeline277,965 IOCs

Family match rate85%

Generic fallback15%

Session dedup~12% reduction

Families covered: Lumma C2, RisePro, WhiteSnake, RedLine, plus 8 aggregator brands handled by GenericGrammar.

11. The Aggregator Problem

A significant fraction of Telegram stealer-log channels are not operators; they are aggregators. They buy log packs from multiple stealer networks, strip identifying metadata, rebrand the archives, and resell them to credential stuffers.

This matters for two reasons. First, the rebranding often removes or renames the family-specific files (like domain_detect.txt), pushing more archives into the generic fallback. Second, aggregators sometimes use the password field itself as advertising space:

URL: https://mail.google.com
UserName: alice@gmail.com
Password: JOIN @BESTLOGS FOR MORE  <- channel promotion, not a credential

The extractor detects and discards these entries using a heuristic: if the password field matches a @ handle, a t.me/ URL, or common promotional phrases, the record is flagged and excluded from the credential index.

Known aggregator brands encountered in the corpus: Exodus, AzoriX, MOAB Stealer, PureLog, Shadow Logs, UnknownStealer, Meduza, StealC repacks, and several unnamed ones identified only by watermark patterns.

The watermark removal is a separate post's worth of content. In short: most watermarks are added as directory name prefixes ([EXODUS] SESSION_123/) or injected into System Info.txt headers. Both are stripped during layout classification before the grammar sees them.

12. Intelligence at the Top

With parsed data in OpenSearch and the relationship graph in Neo4j, meaningful intelligence queries become fast.

12.1 Domain Exposure Reports

The most common operational query: "Is domain X in the credential set?" Two-pass OpenSearch aggregation:

First pass: count sessions containing credentials for the target domain
Second pass: enumerate unique username/password pairs, deduplicated by credential hash

Results feed directly into the /hunt API endpoint and surface in the frontend's domain search.

12.2 Channel Topology via Neo4j

The graph model captures how threat actor infrastructure connects:

MATCH (a:Source)-[:FORWARDED_FROM_SOURCE]->(b:Source)
WHERE b.identifier = '@target_channel'
RETURN a.title, a.member_count, a.created_at
ORDER BY a.member_count DESC

This surfaces every channel that has forwarded content from a target channel, revealing distribution networks and shared admin infrastructure that is invisible from message-level analysis.

12.3 Actor Pivoting

MATCH (u:TelegramUser)-[:ACTIVE_IN]->(s:Source)
WHERE s.identifier IN ['@chan_a', '@chan_b']
WITH u, collect(s.identifier) AS channels
WHERE size(channels) > 1
RETURN u.username, channels

A user active in multiple monitored channels is a pivot point, potentially an operator, reseller, or admin cross-posting content.

13. TLD Analytics and Geographic Density

315M+ credentials across 60 TLDs reveal a clear geographic picture of who is being targeted. The .com namespace dominates at 148M credentials, unsurprising given that most global services use .com domains. Among country-code TLDs, Brazil leads by a wide margin (7.1M), followed by India (4.8M) and Indonesia (3.6M).

The per-capita view is more revealing. When normalized against population, smaller countries can surface as disproportionately targeted. Countries in Latin America (Peru, Chile) and Southeast Asia (Vietnam, Indonesia) rank higher per-capita than their absolute numbers suggest, consistent with the known geographic distribution of stealer-malware campaigns that favor regions with high smartphone penetration but lower security awareness.

Credential Density by Country

315M+ credentials across 60 TLDs. Country-level aggregation via ccTLD mapping.

7.1M(log scale, ccTLD only)

The TLD analytics layer is powered by a full OpenSearch aggregation over the credential URL domain field, grouped by ccTLD suffix. The query is expensive (full index scan across tens of millions of documents) and is cached for one hour. The result feeds the world heatmap and ranked table in the frontend's /tld view.

14. Results

One benchmark run against a real-world corpus:

Metric	Value
Credentials	314,717
Cookies	14,900,000
Autofill records	28,409
Credit card records	1,204
Parser throughput	10.5 MB/s
Messages indexed	234,757
Message indexing rate	1,958.7 msg/s
IOCs extracted	277,965
IOC type breakdown	68.1% domains, 18.4% URLs, 10.4% IPs, 2.2% wallets, 0.8% CVEs

The grammar router correctly identified the family for 85% of archives. The remaining 15% fell through to GenericGrammar, which still extracted credentials from most of them, just with lower field completeness.

What's Next

The parser backlog has a few open items: email extraction (deferred because regexes produce too many false positives without a verification step), browser history parsing (the files exist in most archives but are not yet structured), and MLflow integration for tracking grammar performance across corpus versions.

On the collection side, I want to add reaction-weighted scrape priority: channels where file posts get heavy reactions (indicating active buyers) should be scraped more aggressively than quiet channels.

The frontend's /explorer view is partially built; it can display sessions and credentials but does not yet surface the HDBSCAN cluster view or the actor-pivot graph inline. That is the next frontend sprint.

If you are working on something in the same space (authorized Telegram CTI, stealer-log parsing, or threat actor graph analysis), I am happy to talk through the grammar design or the Kafka topology in more detail.

References & Further Reading

Telethon documentation: MTProto client for Python
OWASP ZIP Slip Vulnerability: archive traversal attack surface
IntelOwl: open-source threat intelligence orchestration
Bianco, D. The Pyramid of Pain (referenced in CTI Foundations)
CyNER: Cybersecurity Named Entity Recognition, Ranade et al., 2021
HDBSCAN: Density-Based Clustering, Campello et al., 2013
Grammar-Based Stealer Log Parsing previous post on the parsing approach that Golliath's grammar system extends

Authorized research only. Everything described here was conducted in a controlled environment against data I had lawful authorization to analyze. Golliath does not facilitate unauthorized access, credential use, or victim contact of any kind.

1. Why Telegram? Why Now?

2. Golliath at a Glance

Golliath is not a single application. It is six loosely-coupled subsystems that share storage and communicate through Kafka and HTTP:

Subsystem	Language	Role
data_lake	Go + Python	Collection, scraping, file download, event streaming, indexing
parser_logs	Python	Archive parsing pipeline: stealer family grammars, field extractors, NLP enrichment
intelligence-analysis	Python	ML benchmarking, IOC extraction, threat classification, clustering
frontend	Next.js 14	Dashboards, search, graph explorer, source management
infra	Docker Compose	Kafka KRaft, OpenSearch, MinIO, Neo4j, Redis, PostgreSQL, Prometheus, MLflow
sdk	Go	Shared HTTP client, job types, retry logic

Golliath Platform Architecture

Click any service to inspect its role, implementation, and data outputs.

Source

Collection

Transport

Processing

Storage

3. Source Coverage

Before building any parser, you need data, and that means maintaining a curated source registry. Golliath currently monitors 643 Telegram sources across ten categories:

Category	Sources	Description
logs	325	Stealer-log channels, archive dumps, credential shares
chatter	108	General cybercrime discussion, threat actor communication
ULP	55	URL:login:password combo lists
combos	35	Pre-formatted credential stuffing lists
actors	33	Known threat actor personal channels
leaks	26	Database breach announcements and dumps
marketplaces	26	Credential sale listings
infrastructure	17	C2, bulletproof hosting, proxy sale channels
exploit-sharing	13	PoC and weaponized exploit distribution
malware	5	Malware distribution and update channels

4. Scraping at Scale

4.1 The Session Pool

Scraping a channel means calling iter_messages with a checkpoint cursor:

async for message in client.iter_messages(
    entity,
    min_id=checkpoint_id,
    reverse=True,
    limit=3000,
):
    await publish_to_worker(message)

4.2 Job Dispatch with SKIP LOCKED

The download-worker claims scrape jobs from PostgreSQL using a pattern that avoids distributed locks:

SELECT id, source_id, channel_identifier
FROM scrape_jobs
WHERE status = 'pending'
  AND scheduled_at <= NOW()
ORDER BY priority DESC, scheduled_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1;

4.3 Kafka and File Priority Tiers

Once a message is fetched, the download-worker publishes it to the messages.discovered Kafka topic. Files attached to that message get routed to one of two priority topics:

files.priority: text files (credential dumps, plain-text logs)
files.archives: ZIP, RAR, 7z, TAR

4.4 Hunting Invite Links

var reInviteLink = regexp.MustCompile(
    `(?i)https?://(?:t\.me|telegram\.me|telegram\.dog)/(?:\+|joinchat/)([A-Za-z0-9_\-]{10,})`)

5. Message Indexing and LLM-Assisted Channel Categorization

5.1 Real-Time Indexing

5.2 Sampling and LLM Pre-Labeling

Manually reviewing hundreds of channel messages to decide a source's correct category does not scale. Instead, Golliath uses a corpus-first approach:

Sample a random subset of messages from each unclassified source (typically 20-50 messages)
Pass the sample to Gemini 2.5 Flash for pre-labeling: the model extracts structured fields (URL, USERNAME, PASSWORD, DOMAIN, SUBDOMAIN) from each line
Human analyst reviews the labels in Doccano, correcting errors
The labeled dataset trains a Gradient Boosting Classifier (85/15 train/test split, confidence threshold 0.70) used to auto-categorize future sources

6. The Archive Problem

Each stealer family has evolved its own layout convention. A Lumma archive looks like this:

LummaC2_2024_01_15_SESSION_12345/
├── domain_detect.txt          <- family fingerprint
├── Passwords/
│   ├── Chrome_Default.txt
│   └── Firefox.txt
├── Cookies/
│   ├── Chrome_Default.txt
│   └── network.txt
├── Autofill/
│   └── Chrome_Default.txt
└── System Info.txt

A WhiteSnake archive looks entirely different:

System_HOSTNAME_WIN10PRO/
├── Browsers/
│   ├── Passwords/
│   └── Cookies/
├── Cookies/                   <- root-level cookies too
├── BankCards.txt
└── System.txt

7. The Parser Pipeline

The parser_logs subsystem implements a six-stage pipeline. Archives go in, structured JSON records come out.

parser_logs Processing Pipeline

Six-stage pipeline from raw archive to enriched structured intelligence. Click any stage.

Walker

Recursive Traversal

Recursively walk the extracted archive tree. Enforce safety limits before touching anything.

Output

File path + metadata stream

Implementation

-Extracts ZIP/RAR/7z/TAR with format-specific decompressors
-Archive bomb guard: ratio > 20x above 1 MB — abort
-Entry count cap: > 100,000 entries — abort
-Nested archive depth: max 3 levels
-Zip-slip protection: resolves all symlinks before extraction
-Emits { path, size, mtime } records for classifier

7.1 Safety Before Parsing

ARCHIVE_BOMB_RATIO = 20        # compressed:uncompressed > 20x -> abort
ARCHIVE_BOMB_MIN_SIZE = 1_048_576  # only apply ratio check above 1 MB
MAX_ENTRY_COUNT = 100_000
MAX_NESTED_DEPTH = 3

Zip-slip protection resolves all paths relative to the extraction root and rejects any entry whose resolved path escapes it:

def _safe_extract_path(archive_root: Path, entry_name: str) -> Path:
    target = (archive_root / entry_name).resolve()
    if not str(target).startswith(str(archive_root.resolve())):
        raise ZipSlipError(f"Attempted path traversal: {entry_name}")
    return target

These are not theoretical risks. Zip-slip payloads appear in the wild corpus.

7.2 Layout Classification and Grammar Selection

After safe extraction, the layout classifier surveys the directory tree without reading file contents. It looks for family-specific structural signatures:

domain_detect.txt or domaindetect.txt at root: Lumma signal
Network/ containing UserName_* subdirectories: RisePro signal
System_HOSTNAME/ directories at root: WhiteSnake signal
log.txt at root with a Passwords/ sibling: RedLine signal

These signals are passed to the GrammarRouter, which calls matches() on every registered grammar and picks the highest scorer above 0.60:

class GrammarRouter:
    def __init__(self, grammars: list[FamilyGrammar]) -> None:
        self._grammars = sorted(grammars, key=lambda g: g.priority, reverse=True)

    def route(self, layout: ArchiveLayout) -> FamilyGrammar:
        scores = [(g, g.matches(layout)) for g in self._grammars]
        best_grammar, best_score = max(scores, key=lambda x: x[1])
        if best_score >= CONFIDENCE_THRESHOLD:
            return best_grammar
        return self._generic  # unconditional fallback

Grammar Router — Confidence Competition

Each FamilyGrammar scores the layout evidence. Highest above 0.60 wins. Explore family signatures or simulate your own archive layout.

WhiteSnakeGrammar

confidence

0.95

Score

Above 0.60 threshold — wins routing

Primary Trigger

System_* + Cookies/

Example

WhiteSnake Stealer aggregator repacks

Evidence scored by matches()

- System_<HOSTNAME>/ directories in root
- Cookies/ directory at root level
- BankCard*.txt files present

Fields extracted

passwords (colon-separated)

autofill

credit cards

8. Family Grammars

Every stealer family gets its own FamilyGrammar subclass. The abstract base defines a simple contract:

FamilyGrammar ABCpython

8.1 LummaGrammar

Lumma is the dominant stealer-as-a-service family in the current corpus. Its archives are identified by domain_detect.txt, a list of domains that the stealer was configured to target, one per line.

LummaGrammar.matches()python

8.2 RiseProGrammar

RiseProGrammar.matches()python

8.3 WhiteSnakeGrammar

WhiteSnake is the highest-confidence detection at 0.95 when both signals co-occur: System_HOSTNAME/ directories (one per victim session in aggregator repacks) and a root-level Cookies/ directory.

8.4 GenericGrammar (The Fallback)

The fallback handles roughly 15% of the corpus in practice, mostly aggregator repacks that strip or rename the family-specific files before redistribution.

9. Field Extractors

9.1 Password Format Cascade

Passwords are parsed in priority order, stopping at the first format that yields results:

1. Key:value blocks (Lumma, some RedLine variants):

URL: https://example.com/login
UserName: alice@corp.com
Password: hunter2

URL: https://bank.example/auth
...

2. TSV format (RisePro):

https://mail.google.com	alice@gmail.com	••••••••
https://github.com	alice	gh_token_abc123

3. Pipe-separated (RedLine v22+):

https://twitter.com/|alice|P@ssw0rd!

4. Colon-separated (WhiteSnake, generic):

https://corp.okta.com:alice@corp.com:S3cur3P@ss

9.2 Cookie Parsing

NETSCAPE_COLUMNS = 7

def parse_netscape_cookies(text: str) -> list[CookieRecord]:
    records = []
    for line in text.splitlines():
        if line.startswith("#") or not line.strip():
            continue
        cols = line.split("\t")
        if len(cols) != NETSCAPE_COLUMNS:
            continue  # log and skip, do not attempt partial parsing
        domain, include_subdomains, path, secure, expiry, name, value = cols
        records.append(CookieRecord(
            domain=domain.lstrip("."),
            path=path,
            name=name,
            value=value,
            secure=secure.upper() == "TRUE",
            http_only=False,
            expires=int(expiry) if expiry.isdigit() else None,
        ))
    return records

The 14.9 million cookies in the benchmark corpus made this the highest-volume extractor by a wide margin. Session cookies are the primary target: a valid session cookie bypasses MFA entirely.

9.3 System Info and the Alias Map

The system info extractor maintains an alias map:

FIELD_ALIASES: dict[str, list[str]] = {
    "hostname": ["computer name", "hostname", "pc name", "computername"],
    "ip": ["ip", "ip address", "external ip", "wan ip"],
    "os": ["os", "operating system", "windows version"],
    "hardware": ["hardware", "cpu", "gpu", "ram", "processor"],
    "country": ["country", "location", "geo"],
    "hwid": ["hwid", "machine id", "hardware id", "uuid"],
}

Normalization means downstream queries (WHERE system_info.country = 'US') work regardless of which stealer family produced the record.

10. NLP Enrichment

After extraction, every session passes through the NLP enrichment pipeline. Two things happen here: extraction and named-entity recognition on the free-text fields.

10.1 IOC Extraction and IntelOwl Integration

IOC extraction uses compiled regex patterns with domain TLD validation:

IOC_PATTERNS = {
    "domain":  re.compile(r"\b(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}\b"),
    "ipv4":    re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b"),
    "url":     re.compile(r"https?://[^\s\"'<>]+"),
    "wallet":  re.compile(r"\b(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,62}\b|0x[a-fA-F0-9]{40}"),
    "cve":     re.compile(r"CVE-\d{4}-\d{4,7}", re.IGNORECASE),
}

10.2 Named Entity Recognition

The enrichment pipeline supports three NER configurations:

Model	F1	Speed	Best for
DistilBERT-NER	0.81	Fast	Default; English stealer logs
CyNER	0.74	Fast	CTI-specific entity types (TTP, malware)
XLM-RoBERTa	0.76	Slow	Non-English archives (Russian, Turkish)

10.3 Session Deduplication via HDBSCAN

Aggregators routinely repackage the same victim sessions under different branding. A session that appeared in three separate archive releases would count three times without deduplication.

Golliath Benchmark Results

Numbers from a single pipeline run on a real-world corpus.

314,717

Credentials

URL + username + password

14.9 M

Netscape-format records

10.5 MB/s

Throughput

end-to-end archive parsing

277,965

IOCs extracted

from 234,757 messages

28,409

Autofill records

form field name/value pairs

1,958 msg/s

Msg indexing

IOC extraction baseline

Credentials314,717

Cookies14,900,000

Autofill records28,409

Credit cards1,204

Throughput10.5 MB/s

Msg indexing1,958.7 msg/s

IOC pipeline277,965 IOCs

Family match rate85%

Generic fallback15%

Session dedup~12% reduction

Families covered: Lumma C2, RisePro, WhiteSnake, RedLine, plus 8 aggregator brands handled by GenericGrammar.

11. The Aggregator Problem

URL: https://mail.google.com
UserName: alice@gmail.com
Password: JOIN @BESTLOGS FOR MORE  <- channel promotion, not a credential

12. Intelligence at the Top

With parsed data in OpenSearch and the relationship graph in Neo4j, meaningful intelligence queries become fast.

12.1 Domain Exposure Reports

The most common operational query: "Is domain X in the credential set?" Two-pass OpenSearch aggregation:

First pass: count sessions containing credentials for the target domain
Second pass: enumerate unique username/password pairs, deduplicated by credential hash

Results feed directly into the /hunt API endpoint and surface in the frontend's domain search.

12.2 Channel Topology via Neo4j

The graph model captures how threat actor infrastructure connects:

MATCH (a:Source)-[:FORWARDED_FROM_SOURCE]->(b:Source)
WHERE b.identifier = '@target_channel'
RETURN a.title, a.member_count, a.created_at
ORDER BY a.member_count DESC

This surfaces every channel that has forwarded content from a target channel, revealing distribution networks and shared admin infrastructure that is invisible from message-level analysis.

12.3 Actor Pivoting

MATCH (u:TelegramUser)-[:ACTIVE_IN]->(s:Source)
WHERE s.identifier IN ['@chan_a', '@chan_b']
WITH u, collect(s.identifier) AS channels
WHERE size(channels) > 1
RETURN u.username, channels

A user active in multiple monitored channels is a pivot point, potentially an operator, reseller, or admin cross-posting content.

13. TLD Analytics and Geographic Density

Credential Density by Country

315M+ credentials across 60 TLDs. Country-level aggregation via ccTLD mapping.

7.1M(log scale, ccTLD only)

14. Results

One benchmark run against a real-world corpus:

Metric	Value
Credentials	314,717
Cookies	14,900,000
Autofill records	28,409
Credit card records	1,204
Parser throughput	10.5 MB/s
Messages indexed	234,757
Message indexing rate	1,958.7 msg/s
IOCs extracted	277,965
IOC type breakdown	68.1% domains, 18.4% URLs, 10.4% IPs, 2.2% wallets, 0.8% CVEs

What's Next

References & Further Reading

Telethon documentation: MTProto client for Python
OWASP ZIP Slip Vulnerability: archive traversal attack surface
IntelOwl: open-source threat intelligence orchestration
Bianco, D. The Pyramid of Pain (referenced in CTI Foundations)
CyNER: Cybersecurity Named Entity Recognition, Ranade et al., 2021
HDBSCAN: Density-Based Clustering, Campello et al., 2013
Grammar-Based Stealer Log Parsing previous post on the parsing approach that Golliath's grammar system extends

1. Why Telegram? Why Now?

2. Golliath at a Glance

3. Source Coverage

4. Scraping at Scale

4.1 The Session Pool

4.2 Job Dispatch with SKIP LOCKED

4.3 Kafka and File Priority Tiers

4.4 Hunting Invite Links

5. Message Indexing and LLM-Assisted Channel Categorization

5.1 Real-Time Indexing

5.2 Sampling and LLM Pre-Labeling

6. The Archive Problem

7. The Parser Pipeline

7.1 Safety Before Parsing

7.2 Layout Classification and Grammar Selection

8. Family Grammars

8.1 LummaGrammar

8.2 RiseProGrammar

8.3 WhiteSnakeGrammar

8.4 GenericGrammar (The Fallback)

9. Field Extractors

9.1 Password Format Cascade

9.2 Cookie Parsing

9.3 System Info and the Alias Map

10. NLP Enrichment

10.1 IOC Extraction and IntelOwl Integration

10.2 Named Entity Recognition

10.3 Session Deduplication via HDBSCAN

11. The Aggregator Problem

12. Intelligence at the Top

12.1 Domain Exposure Reports

12.2 Channel Topology via Neo4j

12.3 Actor Pivoting

13. TLD Analytics and Geographic Density

14. Results

What's Next

References & Further Reading

Read Also

CTI Foundations: Part 2 - The Threat Intelligence Pyramid

SaltStack Internals: Remote Execution and Configuration Management Architecture

Security Onion Fundamentals: Network Security Monitoring and Threat Hunting

CTI Foundations: Part 1 - What Cyber Threat Intelligence Is and Why It Matters

DNS Security Analysis Series: Part 3 - Advanced Attack Techniques and Modern DNS Challenges

Reversing Golang: A Journey into the Internals

The Chomsky Hierarchy and Security: Why Parsers Matter

Windows Development with C++: Part 1 - Foundations

Command & Control in 2025: Architecture, Evasion & Operations

Windows Protected Processes Series: Part 1

1. Why Telegram? Why Now?

2. Golliath at a Glance

3. Source Coverage

4. Scraping at Scale

4.1 The Session Pool

4.2 Job Dispatch with SKIP LOCKED

4.3 Kafka and File Priority Tiers

4.4 Hunting Invite Links

5. Message Indexing and LLM-Assisted Channel Categorization

5.1 Real-Time Indexing

5.2 Sampling and LLM Pre-Labeling

6. The Archive Problem

7. The Parser Pipeline

7.1 Safety Before Parsing

7.2 Layout Classification and Grammar Selection

8. Family Grammars

8.1 LummaGrammar

8.2 RiseProGrammar

8.3 WhiteSnakeGrammar

8.4 GenericGrammar (The Fallback)

9. Field Extractors

9.1 Password Format Cascade

9.2 Cookie Parsing

9.3 System Info and the Alias Map

10. NLP Enrichment

10.1 IOC Extraction and IntelOwl Integration

10.2 Named Entity Recognition

10.3 Session Deduplication via HDBSCAN

11. The Aggregator Problem

12. Intelligence at the Top

12.1 Domain Exposure Reports

12.2 Channel Topology via Neo4j