If you want to pull structured event knowledge out of plain text, you usually start with two questions: what happened? and why did it happen? The first question is the job of event detection, the second is the job of causality identification. In this post I want to walk you through both tasks with a single running example, and then show you the recent papers that try to solve them with little or no task-specific training.
Contents
- The two tasks, on one example
- Event detection, approach 1: zero-shot reasoning with DiCoRe
- Event detection, approach 2: a context-aware encoder with LoRA
- Causality identification: can LLMs do it without training?
- Takeaways
- References
Here is the whole pipeline at a glance β the two stages and their substeps:
+----------+
| Raw text |
+----------+
|
v
+------------------------------------------------------+
| STAGE 1 - EVENT EXTRACTION (Event Detection) |
+------------------------------------------------------+
| |
| 1a) Trigger Identification |
| -> mark the spans that evoke an event |
| | |
| v |
| 1b) Trigger Classification |
| -> label each trigger with an event type |
| |
+------------------------------------------------------+
|
v events = typed triggers
+------------------------------------------------------+
| STAGE 2 - EVENT RELATION EXTRACTION (ERE) |
+------------------------------------------------------+
| for each related pair of events, classify: |
| |
| 2a) Coreference -> are they the same event? |
| 2b) Temporal -> BEFORE / CONTAINS / OVERLAP |
| 2c) Causal -> CAUSE / PRECONDITION |
| 2d) Subevent -> is one event part of another? |
+------------------------------------------------------+
|
v
+------------------------+
| Structured event graph |
+------------------------+
The two stages feed each other: Stage 1 turns raw text into typed events, and Stage 2 wires those events together. Causality identification is the 2c branch β but notice it sits downstream of temporal ordering, since a cause has to precede its effect. Let's anchor everything to one sentence:
A powerful hurricane destroyed the harbor overnight, and thousands of residents fled the city the next morning.
The two tasks, on one example
Event detection (ED) means finding the words that evoke events and labeling each with a type from a fixed ontology. Following the MAVEN dataset, the task is split into two subtasks: trigger identification (which spans express an event) and trigger classification (which event type each trigger belongs to) (Wang et al., 2020). MAVEN is what made this a serious benchmark β 4,480 Wikipedia documents, ~118k event mentions, and 168 fine-grained event types, with a deliberately long-tailed type distribution so that rare events stay rare. On our sentence, an ED system should produce something like:
("hurricane", Catastrophe)
("destroyed", Destroying)
("fled", Escaping)
So you first have to spot the triggers (hurricane, destroyed, fled), and then map each one to its type. The long tail is the hard part: Catastrophe is common, but plenty of MAVEN's 168 types appear only a handful of times, and that is exactly where models fall over.
Causality identification takes the detected events and asks which ones cause which. MAVEN-ERE extends MAVEN with four relation types β coreference, temporal, causal, and subevent β over the same documents (Wang et al., 2022). For causality it uses two labels: CAUSE, where the tail event is inevitable given the head, and PRECONDITION, where the tail would not have happened without the head. On our sentence, the residents fleeing is not inevitable given the destruction, but it would not have happened without it β so the right link is:
(destroyed) PRECONDITION (fled)
One detail worth internalizing: MAVEN-ERE only looks for causal links between events that are already temporally ordered (BEFORE or OVERLAP). Causes precede effects, so temporal structure prunes the search space. That coupling between temporal and causal reasoning will matter when we get to the LLM results.
The classic recipe for both tasks is supervised: train a BERT- or RoBERTa-style classifier on a large labeled corpus. But labels are expensive, especially in specialized domains. So the interesting recent question is: how far can you get without training a model on thousands of annotated examples?
Event detection, approach 1: zero-shot reasoning with DiCoRe
The first approach throws out training entirely. DiCoRe is a zero-shot pipeline built on the observation that prompting an LLM to do ED in one shot overloads it: you are simultaneously asking it to study a closed ontology of up to 168 types, find domain-specific triggers, and emit strict JSON. Cram all of that into one prompt and the model misses events, invents irrelevant ones, and breaks format (Parekh et al., 2025).
DiCoRe decouples the work into three stages:
-
Dreamer reasons divergently. You strip away the ontology and format constraints and just ask the model to name any events it sees, free-form. On our sentence it might emit
("Storm", "hurricane"),("Destruction", "destroyed"),("Evacuation", "fled"). This is deliberately liberal β the goal is recall, not precision. -
Grounder reasons convergently. Now you bring back the closed ontology and map the free-form names onto it β
Storm β Catastrophe,Destruction β Destroying,Evacuation β Escaping. Crucially, the format is enforced with a finite-state machine that guides constrained decoding, so the model can only emit valid event types and in-sentence triggers. The LLM never has to remember the format; the decoder won't let it stray. - Judge verifies each surviving prediction one at a time with a cheap yes/no check, filtering anything spurious to recover precision.
The payoff is that no part of the model is ever cramming all the constraints at once, and the reported result is a 4β7% average F1 gain over the best prior zero-shot baselines across six datasets and nine LLMs β even beating transfer-learning models that were fine-tuned on tens of thousands of examples. The lesson for you: when an LLM struggles on a structured task, decomposing discovery from grounding (and letting a decoder enforce structure) often beats a single clever prompt.
Event detection, approach 2: a context-aware encoder with LoRA
The second approach keeps a small amount of training but stays lightweight. It starts from a different complaint: modern decoder-only LLMs (Llama, Qwen) read left-to-right, and that unidirectional attention is a real bottleneck for an understanding task like ED, where you want the whole sentence before deciding a token's event type (Al Monsur et al., 2026). On our example, a strictly left-to-right reader hits destroyed before it has seen the harbor overnight, which is exactly the context that pins down the event.
Two ideas address this:
- Inject sentence-level context. Instead of classifying each token from its own representation, the authors fold a pooled sentence embedding back into every token. Their best variant uses FiLM (feature-wise linear modulation) to scale and shift each token representation by the sentence context, giving a decoder-only model something closer to bidirectional awareness.
- Adapt with LoRA. Rather than full fine-tuning, they use Low-Rank Adaptation. Beyond saving compute, LoRA turns out to act as a regularizer that helps the model generalize to rare event types.
That rare-type angle is the whole point, and it is where evaluation matters. The paper argues that the field's habit of reporting Micro-F1 hides failures on the long tail, because Micro-F1 is dominated by frequent classes. Switch to Macro-F1 β which weights every event type equally β and the LoRA gains on infrequent types become visible. This connects straight back to MAVEN's deliberately long-tailed design: if your Catastrophe-heavy test set is full of common types, you will never notice that the model can't tell apart the rare ones.
Note the contrast with DiCoRe: this is not training-free, it's light training. The same paper also benchmarks pure zero-shot and few-shot prompting as baselines, and those prompting results sit well below the LoRA-adapted models on Macro-F1 β a useful reminder that a few labeled examples plus a cheap adapter still buys you a lot when you care about the tail.
Causality identification: can LLMs do it without training?
Now the harder task. If event detection is starting to look tractable without much training, does causality identification follow? The honest answer, from a careful study of LLMs as annotators on MAVEN-ERE, is: not yet (Wei et al., 2024).
The authors probe GPT-3.5 and LLaMA-2 on all four MAVEN-ERE relation types using several prompt designs (and compare against supervised fine-tuning). Across the board, the prompted LLMs fall well short of a plain supervised RoBERTa baseline. For causal relations the gap is stark β the supervised baseline reaches roughly 31 F1, while GPT-3.5's best prompt lands around 5. Three failure modes explain it:
- Hallucinated events. The models invent event mentions and relations that aren't in the text.
- Broken transitivity. If the model predicts A before B and B before C, it fails to infer A before C β and sometimes predicts the reverse. Since a large share of MAVEN-ERE's relations are recoverable by transitivity, this hurts badly.
- Distance and density. Within one sentence the models do okay; across a long document with many events packed together, they miss the long-range and inter-sentence links.
On our toy sentence, an LLM would probably link destroyed β fled correctly, because both events sit close together in one clean sentence. The trouble starts when that causal chain stretches across paragraphs of a dense document β exactly the discourse-level setting that matters in practice. Supervised fine-tuning narrows the gap, but it still trails a much smaller supervised model trained on the same data, while costing far more.
So treat plain prompting as a useful baseline and annotation aid, not a solved problem. But the failures above are specific, which means they can be engineered against β and that is exactly what the next approach does.
Engineering around the failures: LLMERE
LLMERE is a fine-tuned LLM method that targets the three weak spots head-on (Hu et al., 2025). Three design choices, each mapped to a problem:
-
Q&A instead of pairwise (efficiency). Rather than asking "is there a relation between event A and event B?" for every pair, you specify one event and ask the model to enumerate, in a single pass, all events causally related to it (split by subtype,
CAUSEandPRECONDITION). That drops the number of inferences from O(nΒ²) to O(n) β in their timing, ~14s per document versus ~90s for the pairwise approach. - Document partitioning (coverage). Fine-tuned LLMs have a bounded generation length, so on event-dense documents they simply run out of room and miss relations. LLMERE duplicates the document, highlights only a subset of candidate events in each copy, and merges the answers afterwards β keeping each output short enough to stay complete.
-
Rationales (reasoning). This is the part that answers Wei et al.'s transitivity complaint directly. Alongside each answer, the model is trained to emit two kinds of justification: coreference information ("
stormandhurricaneare the same event") and transitive chains that obey logical rules (storm β flooding β damage, thereforestorm β damage). Learning to produce these chains forces the kind of multi-hop reasoning that pure prompting skipped. Interestingly, the rationale has to come after the answer; putting it before (CoT-style) hurt results, because an early wrong rationale poisons the prediction.
Back to our running example: even if hurricane and a far-away fled never appear in the same sentence, the model can still recover the link by chaining β hurricane is coreferent with a later mention, that mention preconditions the evacuation order, and so on. The rationale is what carries the inference across the distance that defeated the prompted models.
It's worth being clear-eyed about the cost: LLMERE is LoRA instruction-tuned on Llama2-7B/Llama3-8B, so this is fine-tuning, not a training-free trick. But the payoff is real β it edges past the supervised classification state of the art on MAVEN-ERE overall, with a notably larger jump on causal relations specifically, and it leaves few-shot GPT-4 far behind. The takeaway: prompting alone stalls on causality, but a modest amount of fine-tuning that bakes in structure (coreference + transitivity) is enough to overtake the classic supervised pipelines.
Takeaways
If you are building event pipelines:
- For event detection, you can go genuinely training-free by decomposing discovery from grounding and enforcing structure with constrained decoding (DiCoRe), or spend a little training on a parameter-efficient adapter and context injection to fix the long tail (LoRA + FiLM). Either way, watch Macro-F1, not just Micro-F1.
- For causality identification, prompting is a fast baseline, but expect it to fabricate relations, ignore transitivity, and fade over long distances. If you have some labels, light fine-tuning that bakes in structure β a Q&A formulation, document partitioning, and coreference/transitivity rationales (LLMERE) β fixes those failures and can overtake the classic supervised classifiers.
References
- Wang et al., 2020. MAVEN: A Massive General Domain Event Detection Dataset. EMNLP. https://aclanthology.org/2020.emnlp-main.129.pdf
- Wang et al., 2022. MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction. EMNLP. https://arxiv.org/pdf/2211.07342
- Parekh et al., 2025. DiCoRe: Enhancing Zero-shot Event Detection via Divergent-Convergent LLM Reasoning. EMNLP. https://aclanthology.org/2025.emnlp-main.1038.pdf
- Al Monsur et al., 2026. Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes. Findings of EACL. https://aclanthology.org/2026.findings-eacl.314.pdf
- Wei et al., 2024. Are LLMs Good Annotators for Discourse-level Event Relation Extraction? https://arxiv.org/pdf/2407.19568
- Hu et al., 2025. Large Language Model-Based Event Relation Extraction with Rationales. COLING. https://aclanthology.org/2025.coling-main.500.pdf













