Authorized research only. Everything described here was conducted in a controlled environment against data I had lawful authorization to analyze. Golliath does not facilitate unauthorized access, credential use, or victim contact of any kind.
Every day, hundreds of Telegram channels publish stealer logs, archives packed with credentials, cookies, and system fingerprints from compromised machines. Most teams consume this data passively, via commercial feeds, by the time the intelligence is already stale. I wanted to get closer to the source: build something that scraped channels in real time, parsed the archive formats properly, and gave me a structured, searchable view of what was being shared.
That project became Golliath, a six-subsystem platform written in Go and Python, backed by Kafka, OpenSearch, Neo4j, MinIO, and PostgreSQL, with a Next.js frontend for exploration. This post is a deep dive into how it works, why it's built the way it is, and what the numbers look like after running it against a real-world corpus.
1. Why Telegram? Why Now?
Telegram has become the dominant distribution channel for activity and stealer-log marketplaces. The reasons are structural: channels support large file attachments (up to 4 GB), messages are persistent and indexed by Telegram's own servers, and invite-link-based access creates a discoverable social graph of threat actor infrastructure.
Unlike paste sites or dark-web forums, Telegram does not require Tor and responds well to programmatic access via MTProto. The channel topology itself is intelligence: who forwards whom, which channels share an admin, how a new log pack propagates across the ecosystem within hours of publication.
The challenge is volume. A single active channel might post dozens of archives per day, and a monitoring operation watching 50+ channels accumulates data faster than any manual triage workflow can handle. The only way to turn that volume into signal is automation, and automation requires understanding the archive formats.
2. Golliath at a Glance
Golliath is not a single application. It is six loosely-coupled subsystems that share storage and communicate through Kafka and HTTP:
| Subsystem | Language | Role |
|---|---|---|
| data_lake | Go + Python | Collection, scraping, file download, event streaming, indexing |
| parser_logs | Python | Archive parsing pipeline: stealer family grammars, field extractors, NLP enrichment |
| intelligence-analysis | Python | ML benchmarking, IOC extraction, threat classification, clustering |
| frontend | Next.js 14 | Dashboards, search, graph explorer, source management |
| infra | Docker Compose | Kafka KRaft, OpenSearch, MinIO, Neo4j, Redis, PostgreSQL, Prometheus, MLflow |
| sdk | Go | Shared HTTP client, job types, retry logic |
The collection subsystem (data_lake) is the entry point. It scrapes channels, downloads files, and publishes events. Everything downstream (parser, analysis, frontend) consumes from shared storage and search indices. This separation means I can run the parser against the same corpus multiple times as the grammar logic evolves, without re-scraping anything.
3. Source Coverage
Before building any parser, you need data, and that means maintaining a curated source registry. Golliath currently monitors 643 Telegram sources across ten categories:
| Category | Sources | Description |
|---|---|---|
| logs | 325 | Stealer-log channels, archive dumps, credential shares |
| chatter | 108 | General cybercrime discussion, threat actor communication |
| ULP | 55 | URL:login:password combo lists |
| combos | 35 | Pre-formatted credential stuffing lists |
| actors | 33 | Known threat actor personal channels |
| leaks | 26 | Database breach announcements and dumps |
| marketplaces | 26 | Credential sale listings |
| infrastructure | 17 | C2, bulletproof hosting, proxy sale channels |
| exploit-sharing | 13 | PoC and weaponized exploit distribution |
| malware | 5 | Malware distribution and update channels |
Regional breakdown: the majority are tagged global (502), with significant Russian-language coverage (RU: 44), Middle East and North Africa (MENA: 18), and Southeast Asia (SEA: 6). Each source carries a scrape priority (1-10), a configurable interval, and per-source file-download flags controlling whether audio, documents, images, or videos are fetched.
4. Scraping at Scale
4.1 The Session Pool
Telegram's MTProto API does not expose a simple REST endpoint; you authenticate as a user account (or bot) and maintain a session. The session-manager service (Python/Telethon) maintains a pool of authenticated sessions and exposes a small HTTP API that the Go services call.
Scraping a channel means calling iter_messages with a checkpoint cursor:
async for message in client.iter_messages(
entity,
min_id=checkpoint_id,
reverse=True,
limit=3000,
):
await publish_to_worker(message)
reverse=True walks forward from the checkpoint, so resuming a scrape after a restart does not re-process old messages. The checkpoint is stored in PostgreSQL and updated atomically after each batch commits to Kafka.
File downloads use the same session pool. Telethon streams file data in chunks; the session-manager relays those chunks to the download-worker over HTTP. I settled on 512 KB chunks after testing: larger chunks caused session timeouts on slow connections; smaller chunks introduced too much HTTP overhead per file.
4.2 Job Dispatch with SKIP LOCKED
The download-worker claims scrape jobs from PostgreSQL using a pattern that avoids distributed locks:
SELECT id, source_id, channel_identifier
FROM scrape_jobs
WHERE status = 'pending'
AND scheduled_at <= NOW()
ORDER BY priority DESC, scheduled_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1;
SKIP LOCKED means multiple workers can race for jobs without blocking each other. Jobs that are locked by another worker are transparently skipped. This gives horizontal scalability for free: spin up more download-worker replicas and the queue drains faster with zero coordination overhead.
4.3 Kafka and File Priority Tiers
Once a message is fetched, the download-worker publishes it to the messages.discovered Kafka topic. Files attached to that message get routed to one of two priority topics:
files.priority: text files (credential dumps, plain-text logs)files.archives: ZIP, RAR, 7z, TAR
The file-worker consumes these topics in priority order. Text files land in MinIO's txt-dumps/ bucket for immediate parsing; archives go to archives/; executables and unknown formats are quarantined. SHA-256 is computed at download time and stored in PostgreSQL; duplicate files are detected before they are written, keeping the bucket clean.
4.4 Hunting Invite Links
Telegram invite links (t.me/+HASH) appear constantly in channel messages: operators cross-post content, advertise related services, and link to downstream distribution channels. The message-indexer extracts every invite link from indexed messages using a regex that covers all common patterns:
var reInviteLink = regexp.MustCompile(
`(?i)https?://(?:t\.me|telegram\.me|telegram\.dog)/(?:\+|joinchat/)([A-Za-z0-9_\-]{10,})`)
Matched hashes are stored in a PostgreSQL invite_links table with their source context. The frontend's /hunt view resolves these hashes via CheckChatInviteRequest, Telegram's RPC for probing private channel metadata without joining. This surfaces channel title, member count, description, and admin signatures from a hash alone.
The result is a ranked expansion surface: a small channel under monitoring often contains invite links to a much larger aggregator or supply channel that was never manually added to the source registry. One invite-link pivot in the right direction can add 10 new monitored sources.
5. Message Indexing and LLM-Assisted Channel Categorization
5.1 Real-Time Indexing
The message-indexer service (Go) consumes the messages.discovered Kafka topic and writes every message to OpenSearch. At 1,958.7 messages/second, new content from any monitored channel is searchable within seconds of being published.
After 234,757 indexed messages across the monitored source set, the OpenSearch index becomes a live intelligence feed: analysts can query for a specific domain, an IP range, a malware family name, or a channel handle and get ranked results with full message context.
5.2 Sampling and LLM Pre-Labeling
Manually reviewing hundreds of channel messages to decide a source's correct category does not scale. Instead, Golliath uses a corpus-first approach:
- Sample a random subset of messages from each unclassified source (typically 20-50 messages)
- Pass the sample to Gemini 2.5 Flash for pre-labeling: the model extracts structured fields (URL, USERNAME, PASSWORD, DOMAIN, SUBDOMAIN) from each line
- Human analyst reviews the labels in Doccano, correcting errors
- The labeled dataset trains a Gradient Boosting Classifier (85/15 train/test split, confidence threshold 0.70) used to auto-categorize future sources
The prompt instructs Gemini to return a strict JSON array with no markdown, handling all four common credential formats: email:password, url:user:pass, url|user|pass, and variations where passwords contain colons (joined as trailing fields). Labels for the hardest 1,000 cases were fully reviewed before being committed to the training set.
This approach reduced the manual categorization burden by ~80% compared to reading raw channel content. The residual 20% (sources where the classifier confidence falls below threshold or the content is ambiguous) are flagged for human review.
6. The Archive Problem
Open any stealer-log archive and you will find a surprisingly consistent internal structure. The stealer client runs on the victim machine, collects credentials from browsers, writes everything to a local directory, and then compresses it before exfiltrating to the C2 server or dropping it directly to a Telegram channel.
Each stealer family has evolved its own layout convention. A Lumma archive looks like this:
LummaC2_2024_01_15_SESSION_12345/
├── domain_detect.txt <- family fingerprint
├── Passwords/
│ ├── Chrome_Default.txt
│ └── Firefox.txt
├── Cookies/
│ ├── Chrome_Default.txt
│ └── network.txt
├── Autofill/
│ └── Chrome_Default.txt
└── System Info.txt
A WhiteSnake archive looks entirely different:
System_HOSTNAME_WIN10PRO/
├── Browsers/
│ ├── Passwords/
│ └── Cookies/
├── Cookies/ <- root-level cookies too
├── BankCards.txt
└── System.txt
These are not cosmetic differences. A parser that assumes Lumma layout on a WhiteSnake archive will misroute the password files and produce nothing. Before I could write a single extractor, I needed a way to identify the family and select the right parsing logic.
7. The Parser Pipeline
The parser_logs subsystem implements a six-stage pipeline. Archives go in, structured JSON records come out.
Recursively walk the extracted archive tree. Enforce safety limits before touching anything.
- -Extracts ZIP/RAR/7z/TAR with format-specific decompressors
- -Archive bomb guard: ratio > 20x above 1 MB — abort
- -Entry count cap: > 100,000 entries — abort
- -Nested archive depth: max 3 levels
- -Zip-slip protection: resolves all symlinks before extraction
- -Emits { path, size, mtime } records for classifier
7.1 Safety Before Parsing
The walker enforces hard limits before extracting a single byte of content. This matters because attackers have started distributing archive bombs (intentionally malformed archives that expand to hundreds of gigabytes) and zip-slip payloads that try to write outside the extraction directory.
ARCHIVE_BOMB_RATIO = 20 # compressed:uncompressed > 20x -> abort
ARCHIVE_BOMB_MIN_SIZE = 1_048_576 # only apply ratio check above 1 MB
MAX_ENTRY_COUNT = 100_000
MAX_NESTED_DEPTH = 3
Zip-slip protection resolves all paths relative to the extraction root and rejects any entry whose resolved path escapes it:
def _safe_extract_path(archive_root: Path, entry_name: str) -> Path:
target = (archive_root / entry_name).resolve()
if not str(target).startswith(str(archive_root.resolve())):
raise ZipSlipError(f"Attempted path traversal: {entry_name}")
return target
These are not theoretical risks. Zip-slip payloads appear in the wild corpus.
7.2 Layout Classification and Grammar Selection
After safe extraction, the layout classifier surveys the directory tree without reading file contents. It looks for family-specific structural signatures:
domain_detect.txtordomaindetect.txtat root: Lumma signalNetwork/containingUserName_*subdirectories: RisePro signalSystem_HOSTNAME/directories at root: WhiteSnake signallog.txtat root with aPasswords/sibling: RedLine signal
These signals are passed to the GrammarRouter, which calls matches() on every registered grammar and picks the highest scorer above 0.60:
class GrammarRouter:
def __init__(self, grammars: list[FamilyGrammar]) -> None:
self._grammars = sorted(grammars, key=lambda g: g.priority, reverse=True)
def route(self, layout: ArchiveLayout) -> FamilyGrammar:
scores = [(g, g.matches(layout)) for g in self._grammars]
best_grammar, best_score = max(scores, key=lambda x: x[1])
if best_score >= CONFIDENCE_THRESHOLD:
return best_grammar
return self._generic # unconditional fallback
System_* + Cookies/- - System_<HOSTNAME>/ directories in root
- - Cookies/ directory at root level
- - BankCard*.txt files present
The threshold of 0.60 was chosen empirically: below it, the grammar was wrong more often than the generic fallback would be. Tie-breaking favors more-specific grammars; GenericGrammar always returns 0.0, so it never wins a competition.
8. Family Grammars
Every stealer family gets its own FamilyGrammar subclass. The abstract base defines a simple contract:
8.1 LummaGrammar
Lumma is the dominant stealer-as-a-service family in the current corpus. Its archives are identified by domain_detect.txt, a list of domains that the stealer was configured to target, one per line.
The domain_detect.txt file is also intelligence in itself: it reveals which domains the affiliate configured the stealer to prioritize, often including specific banking, crypto exchange, and corporate SSO domains.
8.2 RiseProGrammar
RisePro organizes each victim session into a named subdirectory under Network/. The directory name encodes a username hash: Network/UserName_a3f9b2/. This layout is distinctive enough that the classifier scores it at ~0.88.
8.3 WhiteSnakeGrammar
WhiteSnake is the highest-confidence detection at 0.95 when both signals co-occur: System_HOSTNAME/ directories (one per victim session in aggregator repacks) and a root-level Cookies/ directory.
8.4 GenericGrammar (The Fallback)
When no family grammar scores above 0.60, GenericGrammar takes over. It runs all four password format parsers in cascade and tries the best-effort cookie parser. Coverage is lower than a matched grammar, but it is still far better than discarding the archive entirely.
The fallback handles roughly 15% of the corpus in practice, mostly aggregator repacks that strip or rename the family-specific files before redistribution.
9. Field Extractors
Once the grammar routes the archive, per-role extractors run against the classified file list. The most complex is the password extractor, which handles four distinct formats across stealer families.
9.1 Password Format Cascade
Passwords are parsed in priority order, stopping at the first format that yields results:
1. Key:value blocks (Lumma, some RedLine variants):
URL: https://example.com/login
UserName: alice@corp.com
Password: hunter2
URL: https://bank.example/auth
...
2. TSV format (RisePro):
https://mail.google.com alice@gmail.com ••••••••
https://github.com alice gh_token_abc123
3. Pipe-separated (RedLine v22+):
https://twitter.com/|alice|P@ssw0rd!
4. Colon-separated (WhiteSnake, generic):
https://corp.okta.com:alice@corp.com:S3cur3P@ss
Each format parser is implemented as a standalone function that returns list[PasswordRecord] | None. The cascade tries them in order and short-circuits on the first non-empty result. If all four fail, the file is logged as unparseable and added to the backlog.
9.2 Cookie Parsing
Cookies follow the Netscape 7-column TSV format that browsers export. The parser enforces strict column counts; the most common corruption in wild archives is lines with missing tab separators, which a lenient parser would silently misparse into wrong fields.
NETSCAPE_COLUMNS = 7
def parse_netscape_cookies(text: str) -> list[CookieRecord]:
records = []
for line in text.splitlines():
if line.startswith("#") or not line.strip():
continue
cols = line.split("\t")
if len(cols) != NETSCAPE_COLUMNS:
continue # log and skip, do not attempt partial parsing
domain, include_subdomains, path, secure, expiry, name, value = cols
records.append(CookieRecord(
domain=domain.lstrip("."),
path=path,
name=name,
value=value,
secure=secure.upper() == "TRUE",
http_only=False,
expires=int(expiry) if expiry.isdigit() else None,
))
return records
The 14.9 million cookies in the benchmark corpus made this the highest-volume extractor by a wide margin. Session cookies are the primary target: a valid session cookie bypasses MFA entirely.
9.3 System Info and the Alias Map
Every stealer writes a system fingerprint file, but the field names are not standardized. Lumma calls it System Info.txt with IP: 1.2.3.4 lines; WhiteSnake uses System.txt with Ip = lines; RedLine uses log.txt with IP Address: lines.
The system info extractor maintains an alias map:
FIELD_ALIASES: dict[str, list[str]] = {
"hostname": ["computer name", "hostname", "pc name", "computername"],
"ip": ["ip", "ip address", "external ip", "wan ip"],
"os": ["os", "operating system", "windows version"],
"hardware": ["hardware", "cpu", "gpu", "ram", "processor"],
"country": ["country", "location", "geo"],
"hwid": ["hwid", "machine id", "hardware id", "uuid"],
}
Normalization means downstream queries (WHERE system_info.country = 'US') work regardless of which stealer family produced the record.
10. NLP Enrichment
After extraction, every session passes through the NLP enrichment pipeline. Two things happen here: extraction and named-entity recognition on the free-text fields.
10.1 IOC Extraction and IntelOwl Integration
IOC extraction uses compiled regex patterns with domain TLD validation:
IOC_PATTERNS = {
"domain": re.compile(r"\b(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}\b"),
"ipv4": re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b"),
"url": re.compile(r"https?://[^\s\"'<>]+"),
"wallet": re.compile(r"\b(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,62}\b|0x[a-fA-F0-9]{40}"),
"cve": re.compile(r"CVE-\d{4}-\d{4,7}", re.IGNORECASE),
}
The domain extractor runs every match through a TLD allowlist before accepting it. This eliminates the long tail of false positives from free-form text; words like example.log or victim.txt look like domains to a naive regex but are discarded after TLD rejection.
High-priority IOCs (IPs and domains that appear across multiple sessions or match watchlist entries) are optionally enriched via IntelOwl, an open-source threat intelligence orchestration platform that aggregates reputation data from VirusTotal, AbuseIPDB, Shodan, and dozens of other sources. The integration is fire-and-forget: IOCs above a configurable session-frequency threshold are submitted to IntelOwl's REST API, and results are written back to the sessions index as enrichment fields. This turns raw IOC extraction into enriched, context-bearing threat data without blocking the extraction pipeline.
10.2 Named Entity Recognition
The enrichment pipeline supports three NER configurations:
| Model | F1 | Speed | Best for |
|---|---|---|---|
| DistilBERT-NER | 0.81 | Fast | Default; English stealer logs |
| CyNER | 0.74 | Fast | CTI-specific entity types (TTP, malware) |
| XLM-RoBERTa | 0.76 | Slow | Non-English archives (Russian, Turkish) |
CyNER was developed specifically for cybersecurity text and handles entity types that general NER models miss, but its overall F1 on the benchmark was slightly below DistilBERT. XLM-RoBERTa is the right choice for archives from Eastern European operator groups where the system info and file paths contain Cyrillic text.
10.3 Session Deduplication via HDBSCAN
Aggregators routinely repackage the same victim sessions under different branding. A session that appeared in three separate archive releases would count three times without deduplication.
clustering on multilingual embeddings catches near-duplicate sessions: if two sessions have the same hostname, IP, and browser profile fingerprint, they cluster tightly regardless of the archive they came from. The benchmark showed ~12% of sessions in the corpus were duplicates by this measure.
11. The Aggregator Problem
A significant fraction of Telegram stealer-log channels are not operators; they are aggregators. They buy log packs from multiple stealer networks, strip identifying metadata, rebrand the archives, and resell them to credential stuffers.
This matters for two reasons. First, the rebranding often removes or renames the family-specific files (like domain_detect.txt), pushing more archives into the generic fallback. Second, aggregators sometimes use the password field itself as advertising space:
URL: https://mail.google.com
UserName: alice@gmail.com
Password: JOIN @BESTLOGS FOR MORE <- channel promotion, not a credential
The extractor detects and discards these entries using a heuristic: if the password field matches a @ handle, a t.me/ URL, or common promotional phrases, the record is flagged and excluded from the credential index.
Known aggregator brands encountered in the corpus: Exodus, AzoriX, MOAB Stealer, PureLog, Shadow Logs, UnknownStealer, Meduza, StealC repacks, and several unnamed ones identified only by watermark patterns.
The watermark removal is a separate post's worth of content. In short: most watermarks are added as directory name prefixes ([EXODUS] SESSION_123/) or injected into System Info.txt headers. Both are stripped during layout classification before the grammar sees them.
12. Intelligence at the Top
With parsed data in OpenSearch and the relationship graph in Neo4j, meaningful intelligence queries become fast.
12.1 Domain Exposure Reports
The most common operational query: "Is domain X in the credential set?" Two-pass OpenSearch aggregation:
- First pass: count sessions containing credentials for the target domain
- Second pass: enumerate unique username/password pairs, deduplicated by credential hash
Results feed directly into the /hunt API endpoint and surface in the frontend's domain search.
12.2 Channel Topology via Neo4j
The graph model captures how threat actor infrastructure connects:
MATCH (a:Source)-[:FORWARDED_FROM_SOURCE]->(b:Source)
WHERE b.identifier = '@target_channel'
RETURN a.title, a.member_count, a.created_at
ORDER BY a.member_count DESC
This surfaces every channel that has forwarded content from a target channel, revealing distribution networks and shared admin infrastructure that is invisible from message-level analysis.
12.3 Actor Pivoting
MATCH (u:TelegramUser)-[:ACTIVE_IN]->(s:Source)
WHERE s.identifier IN ['@chan_a', '@chan_b']
WITH u, collect(s.identifier) AS channels
WHERE size(channels) > 1
RETURN u.username, channels
A user active in multiple monitored channels is a pivot point, potentially an operator, reseller, or admin cross-posting content.
13. TLD Analytics and Geographic Density
315M+ credentials across 60 TLDs reveal a clear geographic picture of who is being targeted. The .com namespace dominates at 148M credentials, unsurprising given that most global services use .com domains. Among country-code TLDs, Brazil leads by a wide margin (7.1M), followed by India (4.8M) and Indonesia (3.6M).
The per-capita view is more revealing. When normalized against population, smaller countries can surface as disproportionately targeted. Countries in Latin America (Peru, Chile) and Southeast Asia (Vietnam, Indonesia) rank higher per-capita than their absolute numbers suggest, consistent with the known geographic distribution of stealer-malware campaigns that favor regions with high smartphone penetration but lower security awareness.
The TLD analytics layer is powered by a full OpenSearch aggregation over the credential URL domain field, grouped by ccTLD suffix. The query is expensive (full index scan across tens of millions of documents) and is cached for one hour. The result feeds the world heatmap and ranked table in the frontend's /tld view.
14. Results
One benchmark run against a real-world corpus:
| Metric | Value |
|---|---|
| Credentials | 314,717 |
| Cookies | 14,900,000 |
| Autofill records | 28,409 |
| Credit card records | 1,204 |
| Parser throughput | 10.5 MB/s |
| Messages indexed | 234,757 |
| Message indexing rate | 1,958.7 msg/s |
| IOCs extracted | 277,965 |
| IOC type breakdown | 68.1% domains, 18.4% URLs, 10.4% IPs, 2.2% wallets, 0.8% CVEs |
The grammar router correctly identified the family for 85% of archives. The remaining 15% fell through to GenericGrammar, which still extracted credentials from most of them, just with lower field completeness.
What's Next
The parser backlog has a few open items: email extraction (deferred because regexes produce too many false positives without a verification step), browser history parsing (the files exist in most archives but are not yet structured), and MLflow integration for tracking grammar performance across corpus versions.
On the collection side, I want to add reaction-weighted scrape priority: channels where file posts get heavy reactions (indicating active buyers) should be scraped more aggressively than quiet channels.
The frontend's /explorer view is partially built; it can display sessions and credentials but does not yet surface the HDBSCAN cluster view or the actor-pivot graph inline. That is the next frontend sprint.
If you are working on something in the same space (authorized Telegram CTI, stealer-log parsing, or threat actor graph analysis), I am happy to talk through the grammar design or the Kafka topology in more detail.
References & Further Reading
- Telethon documentation: MTProto client for Python
- OWASP ZIP Slip Vulnerability: archive traversal attack surface
- IntelOwl: open-source threat intelligence orchestration
- Bianco, D. The Pyramid of Pain (referenced in CTI Foundations)
- CyNER: Cybersecurity Named Entity Recognition, Ranade et al., 2021
- HDBSCAN: Density-Based Clustering, Campello et al., 2013
- Grammar-Based Stealer Log Parsing previous post on the parsing approach that Golliath's grammar system extends








