<<

Building Semantica: Search for Meaning in Markdown

I’ve recently come across the idea of Retrieval-Augmented Generation (RAG) and wanted to experiment with it. Since I had in my backlog to set up a local LLM, for my notes and documents, I wanted to start implementing the retrieval process to improve the responses of the model I will use.

It was a fun project to build and it requires just some building blocks. I am also excited about it because there’s a lot of improvement to be made and therefore so much to learn.

This is how I have assembled Semantica, a local, FAISS-based search engine for my markdown notes. It’s still not very fast or efficient but it works great and, most importantly, it is fully local.

Most search engines utilize keyword-matching at the core of their algorithm. Semantic search on the other hand is a formal (math) way of represent meaning. Instead of matching exact words, it looks at context and intent to find the most relevant results. It does this by converting text into vectors and finding similarities between them. This means you can search with natural language, and get results that make sense, even if they don’t contain the exact words you typed.

This is a very simplistic look at semantics, from the perspective of a software engineer. Semantics is a part of the vast and fascinating field of linguistic.

I used Obsidian for the last 3 years as a second brain 1. This means I have thousands of notes that span university classes, random ideas, and book notes.

The built-in fuzzy-search of obsidian is perfect for day-to-day productivity, and well-named notes are the building block of my collection.

Some notes, though, don’t get reviewed for a long time, or got created an never opened again. I loved the idea of being able to carry out a semantic search mainly for two reasons:

  • finding something old, of which I forgot the title/the existence
  • analysis of the semantic-relationships between notes
    • obsidian allow you to see the the links you created between notes but I’d love to have an automatic semantic linking of the notes. For personal notes this is not THAT useful, but I think for bigger databases, where you are not the only author, that could be extremely useful to navigate the knowledge-base.

Tech Stack & Architecture

The pipeline is very simple:

  1. Convert text and pdf into a vector representation
  2. Build the Index over the vector database
  3. Perform Similarity Search over the Index

To achieve this I used:

  • FAISS – Facebook’s vector search library for similarity search
  • SentenceTransformers – To generate the vectorized database
  • Rich & Typer – For a cooler CLI

Indexing Pipeline

  1. Read all Markdown & PDF files from a folder
  2. Extract text (and optionally YAML metadata) - Since Obsidian uses YAML frontmatter.
  3. Convert text to embeddings using SentenceTransformers from Hugging Face
  4. Store embeddings in FAISS for efficient nearest-neighbor search
  5. Save metadata (e.g., filenames, tags) for easy lookup

Search Pipeline

  1. Load the FAISS index and metadata
  2. Embed the search query into the same vector space
  3. Perform a nearest-neighbor search
  4. Return the top-K most similar notes

Simple CLI

I have built a simple CLI with python’s Rich and Typer. They look cool, but they’re a bit heavy, although definitely not the bottle-neck on performance.

Next steps

This was just the first step as my end goal is a personalized GPT to explore my own thoughts without having to trust the big corps.

Next steps:

  • Semantic Analysis / Semantic Notes Graph
  • Make it a plugin for Obsidian
  • Add local GPT

  1. I don’t like this expression, but since it is well known in the productivity and note-taking subcultures, I use it to give you an idea of which kind of note-taking system I use. Nonetheless, I think the term Second Brain is bullshit marketing. ↩︎