Architecture & Key Concepts

A reference doc for understanding the engineering behind this project.


1. The Core Idea

Polymarket hosts prediction markets, people bet on real-world outcomes (elections, economic events, etc.) and the market price reflects the crowd's estimated probability. For example, a market at $0.70 means the crowd thinks there's a 70% chance of YES.

The hypothesis this system tests: semantically similar markets tend to resolve the same way. If "Will X happen by March?" resolves YES, then "Will X happen by June?" probably will too. The system finds these relationships automatically using embeddings, builds a directed graph, and monitors for signals when one market resolves.


2. Sentence Embeddings (the core technique)

File: src/topic/utils/[embeddings.py](<http://embeddings.py>)

What are embeddings?

An embedding is a fixed-size numerical vector (array of floats) that captures the meaning of text. The model used here is all-mpnet-base-v2 from the sentence-transformers library, which outputs 768-dimensional vectors.

Key properties:

How it works in this project

# Each market has a question + description
texts = [f"{m.question} {m.description[:200]}" for m in markets]

# SentenceTransformer encodes all texts into 768-dim vectors
embeddings = model.encode(texts, normalize_embeddings=True)

The normalize_embeddings=True flag is important, it makes every vector unit length (magnitude = 1.0), which means the dot product of two vectors equals their cosine similarity directly. Without normalization, you'd need to divide by the product of their magnitudes.

# Because vectors are normalized, dot product = cosine similarity
similarity = np.dot(embedding_a, embedding_b)  # range: -1.0 to 1.0

Why not just use keyword matching?