0xHabib
HomePostsVisualizationsCheatsheetsNotesStudy DecksAbout

Built with Love. 0xHabib © 2025

Anonymous analytics are collected for performance monitoring and site improvement purposes.

From Shadows to Signals: Building a Cyber Threat Intelligence Pipeline Over Telegram at Scale

From Shadows to Signals: Building a Cyber Threat Intelligence Pipeline Over Telegram at Scale

How I designed and built Golliath, a modular CTI platform that scrapes Telegram channels at scale, parses stealer-log archives with family-aware grammar routing, and feeds structured intelligence into OpenSearch, Neo4j, and MinIO.

June 22, 2026
20 min read
byMohamed Habib Jaouadi
#cti
#telegram
#stealer-logs
#threat-intelligence
#go
#python
#kafka
#opensearch
#neo4j
#architecture

Authorized research only. Everything described here was conducted in a controlled environment against data I had lawful authorization to analyze. Golliath does not facilitate unauthorized access, credential use, or victim contact of any kind.

Every day, hundreds of Telegram channels publish stealer logs, archives packed with credentials, cookies, and system fingerprints from compromised machines. Most teams consume this data passively, via commercial feeds, by the time the intelligence is already stale. I wanted to get closer to the source: build something that scraped channels in real time, parsed the archive formats properly, and gave me a structured, searchable view of what was being shared.

That project became Golliath, a six-subsystem platform written in Go and Python, backed by Kafka, OpenSearch, Neo4j, MinIO, and PostgreSQL, with a Next.js frontend for exploration. This post is a deep dive into how it works, why it's built the way it is, and what the numbers look like after running it against a real-world corpus.


1. Why Telegram? Why Now?

Telegram has become the dominant distribution channel for activity and stealer-log marketplaces. The reasons are structural: channels support large file attachments (up to 4 GB), messages are persistent and indexed by Telegram's own servers, and invite-link-based access creates a discoverable social graph of threat actor infrastructure.

Unlike paste sites or dark-web forums, Telegram does not require Tor and responds well to programmatic access via MTProto. The channel topology itself is intelligence: who forwards whom, which channels share an admin, how a new log pack propagates across the ecosystem within hours of publication.

The challenge is volume. A single active channel might post dozens of archives per day, and a monitoring operation watching 50+ channels accumulates data faster than any manual triage workflow can handle. The only way to turn that volume into signal is automation, and automation requires understanding the archive formats.


2. Golliath at a Glance

Golliath is not a single application. It is six loosely-coupled subsystems that share storage and communicate through Kafka and HTTP:

SubsystemLanguageRole
data_lakeGo + PythonCollection, scraping, file download, event streaming, indexing
parser_logsPythonArchive parsing pipeline: stealer family grammars, field extractors, NLP enrichment
intelligence-analysisPythonML benchmarking, IOC extraction, threat classification, clustering
frontendNext.js 14Dashboards, search, graph explorer, source management
infraDocker ComposeKafka KRaft, OpenSearch, MinIO, Neo4j, Redis, PostgreSQL, Prometheus, MLflow
sdkGoShared HTTP client, job types, retry logic
Golliath Platform Architecture
Click any service to inspect its role, implementation, and data outputs.
Source
Collection
Transport
Processing
Storage

The collection subsystem (data_lake) is the entry point. It scrapes channels, downloads files, and publishes events. Everything downstream (parser, analysis, frontend) consumes from shared storage and search indices. This separation means I can run the parser against the same corpus multiple times as the grammar logic evolves, without re-scraping anything.


3. Source Coverage

Before building any parser, you need data, and that means maintaining a curated source registry. Golliath currently monitors 643 Telegram sources across ten categories:

CategorySourcesDescription
logs325Stealer-log channels, archive dumps, credential shares
chatter108General cybercrime discussion, threat actor communication
ULP55URL:login:password combo lists
combos35Pre-formatted credential stuffing lists
actors33Known threat actor personal channels
leaks26Database breach announcements and dumps
marketplaces26Credential sale listings
infrastructure17C2, bulletproof hosting, proxy sale channels
exploit-sharing13PoC and weaponized exploit distribution
malware5Malware distribution and update channels

Regional breakdown: the majority are tagged global (502), with significant Russian-language coverage (RU: 44), Middle East and North Africa (MENA: 18), and Southeast Asia (SEA: 6). Each source carries a scrape priority (1-10), a configurable interval, and per-source file-download flags controlling whether audio, documents, images, or videos are fetched.


4. Scraping at Scale

4.1 The Session Pool

Telegram's MTProto API does not expose a simple REST endpoint; you authenticate as a user account (or bot) and maintain a session. The session-manager service (Python/Telethon) maintains a pool of authenticated sessions and exposes a small HTTP API that the Go services call.

Scraping a channel means calling iter_messages with a checkpoint cursor:

async for message in client.iter_messages(
    entity,
    min_id=checkpoint_id,
    reverse=True,
    limit=3000,
):
    await publish_to_worker(message)

reverse=True walks forward from the checkpoint, so resuming a scrape after a restart does not re-process old messages. The checkpoint is stored in PostgreSQL and updated atomically after each batch commits to Kafka.

File downloads use the same session pool. Telethon streams file data in chunks; the session-manager relays those chunks to the download-worker over HTTP. I settled on 512 KB chunks after testing: larger chunks caused session timeouts on slow connections; smaller chunks introduced too much HTTP overhead per file.

4.2 Job Dispatch with SKIP LOCKED

The download-worker claims scrape jobs from PostgreSQL using a pattern that avoids distributed locks:

SELECT id, source_id, channel_identifier
FROM scrape_jobs
WHERE status = 'pending'
  AND scheduled_at <= NOW()
ORDER BY priority DESC, scheduled_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1;

SKIP LOCKED means multiple workers can race for jobs without blocking each other. Jobs that are locked by another worker are transparently skipped. This gives horizontal scalability for free: spin up more download-worker replicas and the queue drains faster with zero coordination overhead.

4.3 Kafka and File Priority Tiers

Once a message is fetched, the download-worker publishes it to the messages.discovered Kafka topic. Files attached to that message get routed to one of two priority topics:

  • files.priority: text files (credential dumps, plain-text logs)
  • files.archives: ZIP, RAR, 7z, TAR

The file-worker consumes these topics in priority order. Text files land in MinIO's txt-dumps/ bucket for immediate parsing; archives go to archives/; executables and unknown formats are quarantined. SHA-256 is computed at download time and stored in PostgreSQL; duplicate files are detected before they are written, keeping the bucket clean.

4.4 Hunting Invite Links

Telegram invite links (t.me/+HASH) appear constantly in channel messages: operators cross-post content, advertise related services, and link to downstream distribution channels. The message-indexer extracts every invite link from indexed messages using a regex that covers all common patterns:

var reInviteLink = regexp.MustCompile(
    `(?i)https?://(?:t\.me|telegram\.me|telegram\.dog)/(?:\+|joinchat/)([A-Za-z0-9_\-]{10,})`)

Matched hashes are stored in a PostgreSQL invite_links table with their source context. The frontend's /hunt view resolves these hashes via CheckChatInviteRequest, Telegram's RPC for probing private channel metadata without joining. This surfaces channel title, member count, description, and admin signatures from a hash alone.

The result is a ranked expansion surface: a small channel under monitoring often contains invite links to a much larger aggregator or supply channel that was never manually added to the source registry. One invite-link pivot in the right direction can add 10 new monitored sources.


5. Message Indexing and LLM-Assisted Channel Categorization

5.1 Real-Time Indexing

The message-indexer service (Go) consumes the messages.discovered Kafka topic and writes every message to OpenSearch. At 1,958.7 messages/second, new content from any monitored channel is searchable within seconds of being published.

After 234,757 indexed messages across the monitored source set, the OpenSearch index becomes a live intelligence feed: analysts can query for a specific domain, an IP range, a malware family name, or a channel handle and get ranked results with full message context.

5.2 Sampling and LLM Pre-Labeling

Manually reviewing hundreds of channel messages to decide a source's correct category does not scale. Instead, Golliath uses a corpus-first approach:

  1. Sample a random subset of messages from each unclassified source (typically 20-50 messages)
  2. Pass the sample to Gemini 2.5 Flash for pre-labeling: the model extracts structured fields (URL, USERNAME, PASSWORD, DOMAIN, SUBDOMAIN) from each line
  3. Human analyst reviews the labels in Doccano, correcting errors
  4. The labeled dataset trains a Gradient Boosting Classifier (85/15 train/test split, confidence threshold 0.70) used to auto-categorize future sources

The prompt instructs Gemini to return a strict JSON array with no markdown, handling all four common credential formats: email:password, url:user:pass, url|user|pass, and variations where passwords contain colons (joined as trailing fields). Labels for the hardest 1,000 cases were fully reviewed before being committed to the training set.

This approach reduced the manual categorization burden by ~80% compared to reading raw channel content. The residual 20% (sources where the classifier confidence falls below threshold or the content is ambiguous) are flagged for human review.


6. The Archive Problem

Open any stealer-log archive and you will find a surprisingly consistent internal structure. The stealer client runs on the victim machine, collects credentials from browsers, writes everything to a local directory, and then compresses it before exfiltrating to the C2 server or dropping it directly to a Telegram channel.

Each stealer family has evolved its own layout convention. A Lumma archive looks like this:

LummaC2_2024_01_15_SESSION_12345/
├── domain_detect.txt          <- family fingerprint
├── Passwords/
│   ├── Chrome_Default.txt
│   └── Firefox.txt
├── Cookies/
│   ├── Chrome_Default.txt
│   └── network.txt
├── Autofill/
│   └── Chrome_Default.txt
└── System Info.txt

A WhiteSnake archive looks entirely different:

System_HOSTNAME_WIN10PRO/
├── Browsers/
│   ├── Passwords/
│   └── Cookies/
├── Cookies/                   <- root-level cookies too
├── BankCards.txt
└── System.txt

These are not cosmetic differences. A parser that assumes Lumma layout on a WhiteSnake archive will misroute the password files and produce nothing. Before I could write a single extractor, I needed a way to identify the family and select the right parsing logic.


7. The Parser Pipeline

The parser_logs subsystem implements a six-stage pipeline. Archives go in, structured JSON records come out.

parser_logs Processing Pipeline
Six-stage pipeline from raw archive to enriched structured intelligence. Click any stage.
Walker
Recursive Traversal

Recursively walk the extracted archive tree. Enforce safety limits before touching anything.

Output
File path + metadata stream
Implementation
  • -Extracts ZIP/RAR/7z/TAR with format-specific decompressors
  • -Archive bomb guard: ratio > 20x above 1 MB — abort
  • -Entry count cap: > 100,000 entries — abort
  • -Nested archive depth: max 3 levels
  • -Zip-slip protection: resolves all symlinks before extraction
  • -Emits { path, size, mtime } records for classifier

7.1 Safety Before Parsing

The walker enforces hard limits before extracting a single byte of content. This matters because attackers have started distributing archive bombs (intentionally malformed archives that expand to hundreds of gigabytes) and zip-slip payloads that try to write outside the extraction directory.

ARCHIVE_BOMB_RATIO = 20        # compressed:uncompressed > 20x -> abort
ARCHIVE_BOMB_MIN_SIZE = 1_048_576  # only apply ratio check above 1 MB
MAX_ENTRY_COUNT = 100_000
MAX_NESTED_DEPTH = 3

Zip-slip protection resolves all paths relative to the extraction root and rejects any entry whose resolved path escapes it:

def _safe_extract_path(archive_root: Path, entry_name: str) -> Path:
    target = (archive_root / entry_name).resolve()
    if not str(target).startswith(str(archive_root.resolve())):
        raise ZipSlipError(f"Attempted path traversal: {entry_name}")
    return target

These are not theoretical risks. Zip-slip payloads appear in the wild corpus.

7.2 Layout Classification and Grammar Selection

After safe extraction, the layout classifier surveys the directory tree without reading file contents. It looks for family-specific structural signatures:

  • domain_detect.txt or domaindetect.txt at root: Lumma signal
  • Network/ containing UserName_* subdirectories: RisePro signal
  • System_HOSTNAME/ directories at root: WhiteSnake signal
  • log.txt at root with a Passwords/ sibling: RedLine signal

These signals are passed to the GrammarRouter, which calls matches() on every registered grammar and picks the highest scorer above 0.60:

class GrammarRouter:
    def __init__(self, grammars: list[FamilyGrammar]) -> None:
        self._grammars = sorted(grammars, key=lambda g: g.priority, reverse=True)

    def route(self, layout: ArchiveLayout) -> FamilyGrammar:
        scores = [(g, g.matches(layout)) for g in self._grammars]
        best_grammar, best_score = max(scores, key=lambda x: x[1])
        if best_score >= CONFIDENCE_THRESHOLD:
            return best_grammar
        return self._generic  # unconditional fallback
Grammar Router — Confidence Competition
Each FamilyGrammar scores the layout evidence. Highest above 0.60 wins. Explore family signatures or simulate your own archive layout.
WhiteSnakeGrammar
confidence
0.95
Score
Above 0.60 threshold — wins routing
Primary Trigger
System_* + Cookies/
Example
WhiteSnake Stealer aggregator repacks
Evidence scored by matches()
  • - System_<HOSTNAME>/ directories in root
  • - Cookies/ directory at root level
  • - BankCard*.txt files present
Fields extracted
passwords (colon-separated)
cookies
autofill
credit cards

The threshold of 0.60 was chosen empirically: below it, the grammar was wrong more often than the generic fallback would be. Tie-breaking favors more-specific grammars; GenericGrammar always returns 0.0, so it never wins a competition.


8. Family Grammars

Every stealer family gets its own FamilyGrammar subclass. The abstract base defines a simple contract:

FamilyGrammar ABCpython

8.1 LummaGrammar

Lumma is the dominant stealer-as-a-service family in the current corpus. Its archives are identified by domain_detect.txt, a list of domains that the stealer was configured to target, one per line.

LummaGrammar.matches()python

The domain_detect.txt file is also intelligence in itself: it reveals which domains the affiliate configured the stealer to prioritize, often including specific banking, crypto exchange, and corporate SSO domains.

8.2 RiseProGrammar

RisePro organizes each victim session into a named subdirectory under Network/. The directory name encodes a username hash: Network/UserName_a3f9b2/. This layout is distinctive enough that the classifier scores it at ~0.88.

RiseProGrammar.matches()python

8.3 WhiteSnakeGrammar

WhiteSnake is the highest-confidence detection at 0.95 when both signals co-occur: System_HOSTNAME/ directories (one per victim session in aggregator repacks) and a root-level Cookies/ directory.

8.4 GenericGrammar (The Fallback)

When no family grammar scores above 0.60, GenericGrammar takes over. It runs all four password format parsers in cascade and tries the best-effort cookie parser. Coverage is lower than a matched grammar, but it is still far better than discarding the archive entirely.

The fallback handles roughly 15% of the corpus in practice, mostly aggregator repacks that strip or rename the family-specific files before redistribution.


9. Field Extractors

Once the grammar routes the archive, per-role extractors run against the classified file list. The most complex is the password extractor, which handles four distinct formats across stealer families.

9.1 Password Format Cascade

Passwords are parsed in priority order, stopping at the first format that yields results:

1. Key:value blocks (Lumma, some RedLine variants):

URL: https://example.com/login
UserName: alice@corp.com
Password: hunter2

URL: https://bank.example/auth
...

2. TSV format (RisePro):

https://mail.google.com	alice@gmail.com	••••••••
https://github.com	alice	gh_token_abc123

3. Pipe-separated (RedLine v22+):

https://twitter.com/|alice|P@ssw0rd!

4. Colon-separated (WhiteSnake, generic):

https://corp.okta.com:alice@corp.com:S3cur3P@ss

Each format parser is implemented as a standalone function that returns list[PasswordRecord] | None. The cascade tries them in order and short-circuits on the first non-empty result. If all four fail, the file is logged as unparseable and added to the backlog.

9.2 Cookie Parsing

Cookies follow the Netscape 7-column TSV format that browsers export. The parser enforces strict column counts; the most common corruption in wild archives is lines with missing tab separators, which a lenient parser would silently misparse into wrong fields.

NETSCAPE_COLUMNS = 7

def parse_netscape_cookies(text: str) -> list[CookieRecord]:
    records = []
    for line in text.splitlines():
        if line.startswith("#") or not line.strip():
            continue
        cols = line.split("\t")
        if len(cols) != NETSCAPE_COLUMNS:
            continue  # log and skip, do not attempt partial parsing
        domain, include_subdomains, path, secure, expiry, name, value = cols
        records.append(CookieRecord(
            domain=domain.lstrip("."),
            path=path,
            name=name,
            value=value,
            secure=secure.upper() == "TRUE",
            http_only=False,
            expires=int(expiry) if expiry.isdigit() else None,
        ))
    return records

The 14.9 million cookies in the benchmark corpus made this the highest-volume extractor by a wide margin. Session cookies are the primary target: a valid session cookie bypasses MFA entirely.

9.3 System Info and the Alias Map

Every stealer writes a system fingerprint file, but the field names are not standardized. Lumma calls it System Info.txt with IP: 1.2.3.4 lines; WhiteSnake uses System.txt with Ip = lines; RedLine uses log.txt with IP Address: lines.

The system info extractor maintains an alias map:

FIELD_ALIASES: dict[str, list[str]] = {
    "hostname": ["computer name", "hostname", "pc name", "computername"],
    "ip": ["ip", "ip address", "external ip", "wan ip"],
    "os": ["os", "operating system", "windows version"],
    "hardware": ["hardware", "cpu", "gpu", "ram", "processor"],
    "country": ["country", "location", "geo"],
    "hwid": ["hwid", "machine id", "hardware id", "uuid"],
}

Normalization means downstream queries (WHERE system_info.country = 'US') work regardless of which stealer family produced the record.


10. NLP Enrichment

After extraction, every session passes through the NLP enrichment pipeline. Two things happen here: extraction and named-entity recognition on the free-text fields.

10.1 IOC Extraction and IntelOwl Integration

IOC extraction uses compiled regex patterns with domain TLD validation:

IOC_PATTERNS = {
    "domain":  re.compile(r"\b(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}\b"),
    "ipv4":    re.compile(r"\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b"),
    "url":     re.compile(r"https?://[^\s\"'<>]+"),
    "wallet":  re.compile(r"\b(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,62}\b|0x[a-fA-F0-9]{40}"),
    "cve":     re.compile(r"CVE-\d{4}-\d{4,7}", re.IGNORECASE),
}

The domain extractor runs every match through a TLD allowlist before accepting it. This eliminates the long tail of false positives from free-form text; words like example.log or victim.txt look like domains to a naive regex but are discarded after TLD rejection.

High-priority IOCs (IPs and domains that appear across multiple sessions or match watchlist entries) are optionally enriched via IntelOwl, an open-source threat intelligence orchestration platform that aggregates reputation data from VirusTotal, AbuseIPDB, Shodan, and dozens of other sources. The integration is fire-and-forget: IOCs above a configurable session-frequency threshold are submitted to IntelOwl's REST API, and results are written back to the sessions index as enrichment fields. This turns raw IOC extraction into enriched, context-bearing threat data without blocking the extraction pipeline.

10.2 Named Entity Recognition

The enrichment pipeline supports three NER configurations:

ModelF1SpeedBest for
DistilBERT-NER0.81FastDefault; English stealer logs
CyNER0.74FastCTI-specific entity types (TTP, malware)
XLM-RoBERTa0.76SlowNon-English archives (Russian, Turkish)

CyNER was developed specifically for cybersecurity text and handles entity types that general NER models miss, but its overall F1 on the benchmark was slightly below DistilBERT. XLM-RoBERTa is the right choice for archives from Eastern European operator groups where the system info and file paths contain Cyrillic text.

10.3 Session Deduplication via HDBSCAN

Aggregators routinely repackage the same victim sessions under different branding. A session that appeared in three separate archive releases would count three times without deduplication.

clustering on multilingual embeddings catches near-duplicate sessions: if two sessions have the same hostname, IP, and browser profile fingerprint, they cluster tightly regardless of the archive they came from. The benchmark showed ~12% of sessions in the corpus were duplicates by this measure.

Golliath Benchmark Results
Numbers from a single pipeline run on a real-world corpus.
314,717
Credentials
URL + username + password
14.9 M
Cookies
Netscape-format records
10.5 MB/s
Throughput
end-to-end archive parsing
277,965
IOCs extracted
from 234,757 messages
28,409
Autofill records
form field name/value pairs
1,958 msg/s
Msg indexing
IOC extraction baseline
Credentials314,717
Cookies14,900,000
Autofill records28,409
Credit cards1,204
Throughput10.5 MB/s
Msg indexing1,958.7 msg/s
IOC pipeline277,965 IOCs
Family match rate85%
Generic fallback15%
Session dedup~12% reduction
Families covered: Lumma C2, RisePro, WhiteSnake, RedLine, plus 8 aggregator brands handled by GenericGrammar.

11. The Aggregator Problem

A significant fraction of Telegram stealer-log channels are not operators; they are aggregators. They buy log packs from multiple stealer networks, strip identifying metadata, rebrand the archives, and resell them to credential stuffers.

This matters for two reasons. First, the rebranding often removes or renames the family-specific files (like domain_detect.txt), pushing more archives into the generic fallback. Second, aggregators sometimes use the password field itself as advertising space:

URL: https://mail.google.com
UserName: alice@gmail.com
Password: JOIN @BESTLOGS FOR MORE  <- channel promotion, not a credential

The extractor detects and discards these entries using a heuristic: if the password field matches a @ handle, a t.me/ URL, or common promotional phrases, the record is flagged and excluded from the credential index.

Known aggregator brands encountered in the corpus: Exodus, AzoriX, MOAB Stealer, PureLog, Shadow Logs, UnknownStealer, Meduza, StealC repacks, and several unnamed ones identified only by watermark patterns.

The watermark removal is a separate post's worth of content. In short: most watermarks are added as directory name prefixes ([EXODUS] SESSION_123/) or injected into System Info.txt headers. Both are stripped during layout classification before the grammar sees them.


12. Intelligence at the Top

With parsed data in OpenSearch and the relationship graph in Neo4j, meaningful intelligence queries become fast.

12.1 Domain Exposure Reports

The most common operational query: "Is domain X in the credential set?" Two-pass OpenSearch aggregation:

  1. First pass: count sessions containing credentials for the target domain
  2. Second pass: enumerate unique username/password pairs, deduplicated by credential hash

Results feed directly into the /hunt API endpoint and surface in the frontend's domain search.

12.2 Channel Topology via Neo4j

The graph model captures how threat actor infrastructure connects:

MATCH (a:Source)-[:FORWARDED_FROM_SOURCE]->(b:Source)
WHERE b.identifier = '@target_channel'
RETURN a.title, a.member_count, a.created_at
ORDER BY a.member_count DESC

This surfaces every channel that has forwarded content from a target channel, revealing distribution networks and shared admin infrastructure that is invisible from message-level analysis.

12.3 Actor Pivoting

MATCH (u:TelegramUser)-[:ACTIVE_IN]->(s:Source)
WHERE s.identifier IN ['@chan_a', '@chan_b']
WITH u, collect(s.identifier) AS channels
WHERE size(channels) > 1
RETURN u.username, channels

A user active in multiple monitored channels is a pivot point, potentially an operator, reseller, or admin cross-posting content.


13. TLD Analytics and Geographic Density

315M+ credentials across 60 TLDs reveal a clear geographic picture of who is being targeted. The .com namespace dominates at 148M credentials, unsurprising given that most global services use .com domains. Among country-code TLDs, Brazil leads by a wide margin (7.1M), followed by India (4.8M) and Indonesia (3.6M).

The per-capita view is more revealing. When normalized against population, smaller countries can surface as disproportionately targeted. Countries in Latin America (Peru, Chile) and Southeast Asia (Vietnam, Indonesia) rank higher per-capita than their absolute numbers suggest, consistent with the known geographic distribution of stealer-malware campaigns that favor regions with high smartphone penetration but lower security awareness.

Credential Density by Country
315M+ credentials across 60 TLDs. Country-level aggregation via ccTLD mapping.
0
7.1M(log scale, ccTLD only)

The TLD analytics layer is powered by a full OpenSearch aggregation over the credential URL domain field, grouped by ccTLD suffix. The query is expensive (full index scan across tens of millions of documents) and is cached for one hour. The result feeds the world heatmap and ranked table in the frontend's /tld view.


14. Results

One benchmark run against a real-world corpus:

MetricValue
Credentials314,717
Cookies14,900,000
Autofill records28,409
Credit card records1,204
Parser throughput10.5 MB/s
Messages indexed234,757
Message indexing rate1,958.7 msg/s
IOCs extracted277,965
IOC type breakdown68.1% domains, 18.4% URLs, 10.4% IPs, 2.2% wallets, 0.8% CVEs

The grammar router correctly identified the family for 85% of archives. The remaining 15% fell through to GenericGrammar, which still extracted credentials from most of them, just with lower field completeness.


What's Next

The parser backlog has a few open items: email extraction (deferred because regexes produce too many false positives without a verification step), browser history parsing (the files exist in most archives but are not yet structured), and MLflow integration for tracking grammar performance across corpus versions.

On the collection side, I want to add reaction-weighted scrape priority: channels where file posts get heavy reactions (indicating active buyers) should be scraped more aggressively than quiet channels.

The frontend's /explorer view is partially built; it can display sessions and credentials but does not yet surface the HDBSCAN cluster view or the actor-pivot graph inline. That is the next frontend sprint.

If you are working on something in the same space (authorized Telegram CTI, stealer-log parsing, or threat actor graph analysis), I am happy to talk through the grammar design or the Kafka topology in more detail.


References & Further Reading

  • Telethon documentation: MTProto client for Python
  • OWASP ZIP Slip Vulnerability: archive traversal attack surface
  • IntelOwl: open-source threat intelligence orchestration
  • Bianco, D. The Pyramid of Pain (referenced in CTI Foundations)
  • CyNER: Cybersecurity Named Entity Recognition, Ranade et al., 2021
  • HDBSCAN: Density-Based Clustering, Campello et al., 2013
  • Grammar-Based Stealer Log Parsing previous post on the parsing approach that Golliath's grammar system extends

Read Also

Threat Intelligence Pyramid framework for defensive prioritization
22 min read
May 19, 202622 min read

CTI Foundations: Part 2 - The Threat Intelligence Pyramid

by Mohamed Habib Jaouadi

Part 2 of the CTI Foundations series. Understand the Pyramid of Pain framework: why hash values are trivial to evade, how infrastructure rotates, and why TTPs are the most durable indicators of compromise.

#cti-foundations
#threat-intelligence
#pyramid-of-pain
+3
SaltStack Master and Minion interaction flow
16 min read
May 12, 202616 min read

SaltStack Internals: Remote Execution and Configuration Management Architecture

by Mohamed Habib Jaouadi

A deep technical analysis of SaltStack. Understand its dual nature as an execution engine and state manager, explore the ZeroMQ event bus, and see how it enables rapid incident response.

#infrastructure-as-code
#saltstack
#devops
+3
Security Onion network security monitoring platform
14 min read
April 27, 202614 min read

Security Onion Fundamentals: Network Security Monitoring and Threat Hunting

by Mohamed Habib Jaouadi

A practical introduction to Security Onion, explaining how it combines network security monitoring, log management, and threat hunting into a coherent blue team platform.

#security-onion
#nsm
#siem
+6
Cyber Threat Intelligence foundations for blue team operations
22 min read
April 26, 202622 min read

CTI Foundations: Part 1 - What Cyber Threat Intelligence Is and Why It Matters

by Mohamed Habib Jaouadi

Part 1 of the CTI Foundations series. Learn what CTI actually is, how the intelligence lifecycle works, and why understanding IOCs, TTPs, and intelligence consumers changes defensive outcomes.

#cti-foundations
#threat-intelligence
#soc
+3
DNS Security Analysis Part 3 - Advanced Attack Techniques and Modern DNS Challenges
27 min read
February 8, 202627 min read

DNS Security Analysis Series: Part 3 - Advanced Attack Techniques and Modern DNS Challenges

by Mohamed Habib Jaouadi

Advanced DNS attack vectors including tunneling, IDN abuse, encrypted DNS protocols, and enterprise security implementation strategies for security analysts.

#dns-security-series
#dns-analysis
#malicious-domains
+3
Reversing Golang Internals
10 min read
January 29, 202610 min read

Reversing Golang: A Journey into the Internals

by Mohamed Habib Jaouadi

A deep dive into reverse engineering Go binaries. Learn about Go's internal data structures, compilation flags, PCLNTAB, ABI changes, and how to reconstruct slice and interface operations in IDA Pro.

#reverse-engineering
#golang
#ida-pro
+3
Formal automata diagrams overlaid on cybersecurity infrastructure
24 min read
December 28, 202524 min read

The Chomsky Hierarchy and Security: Why Parsers Matter

by Mohamed Habib Jaouadi

A deep dive into formal language theory, automata, and Turing machines and their profound implications for cybersecurity. Learn why regex WAFs fail, how injection attacks exploit parser differentials, and how to apply grammar-based parsing to stealer logs and malware analysis.

#LangSec
#Computer Science
#Blue Team
+5
Windows Development with C++ - Win32 API Fundamentals
17 min read
December 18, 202517 min read

Windows Development with C++: Part 1 - Foundations

by Mohamed Habib Jaouadi

Part 1 of the Windows Development series. Master Win32 API fundamentals, window creation, the message loop, and modern C++ patterns for native Windows programming.

#windows-development-series
#win32-api
#c++
+3
Technical visualization of Command and Control infrastructure
14 min read
December 14, 202514 min read

Command & Control in 2025: Architecture, Evasion & Operations

by Mohamed Habib Jaouadi

A technical deep dive into modern C2 architecture (Sliver, Havoc), evasion techniques (Shellter Elite, Stack Spoofing, AMSI Blinding), and alternative infrastructure (Discord C2, Cloud Redirectors).

#C2
#Malware Development
#Red Teaming
+3
Windows Protected Processes - Security Analysis and Inspection Tools
17 min read
November 22, 202517 min read

Windows Protected Processes Series: Part 1

by Mohamed Habib Jaouadi

Part 1 of the Windows Protected Processes series. Learn about protected processes, Process Explorer limitations, and why even administrators can't access critical system processes like CSRSS and LSASS.

#windows-protected-processes-series
#windows-internals
#process-inspection
+3