What's a Vector Database? - andersswanson.dev

Vector databases provide new ways to store and query unstructured data by enabling powerful, AI-driven search capabilities. Unlike traditional database queries that rely on exact matches, vector databases implement similarity-based searches using vector distance functions. Vector Databases are ideal for uses cases like recommendation systems, image recognition, and natural language search.

In this article, we’ll explore the fundamentals of vector databases and their use cases. We’ll see how Oracle Database integrates vector search with relational data to provide a seamless, converged platform.

Illustration of a vector database showing a cylinder representation with the label '23ai' and a series of floating point numbers indicating vector values.

What is a Vector Database?

A vector database provides storage and query capabilities for high-dimensional vector data. Vector data is created using AI Embedding Models. These models transform text, images, videos, and other unstructured data into arrays of floating point numbers.

An infographic illustrating the transformation of document data into N-dimensional vector columns in a database, showcasing example vector values.

There are a variety of vector databases on the market, that range from single-purpose to multi-model. Oracle Database falls into the multi-model bucket, as it support relational, vector, and other data formats.

Why do we use Vector Databases?

Vector databases retrieve results based on semantic similarity. By indexing text, audio, or video embeddings, vector databases find content that shares meaning, tone, or sentiment — even if exact words or features don’t match. Query results are semantically similar to an input vector, whether that vector represents text, audio, video, or something else.

Common use cases include facial recognition, content moderation, and recommendation systems. For example, a video streaming service may recommend content to a user that’s similar to videos they’ve watched previously.

What is an embedding vector?

Embedding models take input data including text, images, audio, and “embed” that data into a vector representation.

Each embedding vector has multiple dimensions that encode the input data. If we embed an image of a cat, the embedding may include metadata about color, breed, number of whiskers, or fur length.

Infographic depicting a vector representation of a cat, showing features such as coloration, breed, number of whiskers, and fur length, with corresponding values.

Vector Similarity

The distance between two vectors is used to determine their “similarity”. If we embed the next three sentences, sentences 1 and 2 would be more similar in vector space than sentence 3.

Three sentences are displayed, with the first two discussing cats and the third discussing parrots.

Embedding these sentences using Oracle Cloud Infrastructure’s AI Playground gave me a simple plot showing their vector distance. The cat-related sentences appear closer in vector space, and therefore more similar than the sentence about parrots.

Output vector projection displaying points labeled 1, 2, and 3 in a 2D space.

When calculating vector distance, there are a variety of distance functions available, such as Cosine Similarity, Euclidean distance, Jaccard similarity, and others. Note that some vector distance algorithms are better suited for certain use cases than others — understanding your data and vector algorithms will help you make the most out of similarity search!

For example, Cosine Similarity is frequently used for tasks such as text mining, sentiment analysis, and document clustering, or where vector magnitude is not important.

Indexing vector data requires specialized algorithms

Vector indexes typically use some form of Nearest Neighbor Search to group related vectors. Because exact Nearest Neighbor is extremely expensive to compute, efficient Approximate Nearest Neighbor versions have been implemented that sacrifice accuracy for reduced resource consumption.

Two of the most common algorithm families for indexing are IVF and HNSW:

Inverted File (IVF) is an indexing technique for approximate nearest neighbor (ANN) search that clusters vectors and queries the most relevant data.
Hierarchical Navigable Small World (HNSW) is a graph-based ANN search algorithm that builds a multi-layer small-world graph where vectors are connected based on their distances.

The following Oracle Database snippet creates a vector index using IVF and cosine vector distance on a column named embedding :

create vector index if not exists vector_index on my_vector_table (embedding)
  organization neighbor partitions
  distance COSINE
  with target accuracy 95
  parameters (type IVF, neighbor partitions 10)

create vector index if not exists vector_index on my_vector_table (embedding)
  organization neighbor partitions
  distance COSINE
  with target accuracy 95
  parameters (type IVF, neighbor partitions 10)

The general rule of thumb is that if you have a very small dataset, you may use exact nearest neighbor search for the highest accuracy (no vector index).

If you have a small-medium large amount of data, HNSW indexes provide the best accuracy, at the cost of increased memory.

If you have a large-huge amount of data, disk-based IVF indexes will yield the best performance, at a lower accuracy than HNSW.

Try Oracle Database 23ai today and see how vector search can enhance your applications

Hands-on, practical experience is a great way to apply and understand the concepts around Vector Databases — to that end, I’ve created a couple articles that implement Vector Similarity Search using Oracle Database Free and Java 21+:

You can also try out vector search using frameworks like Langchain and Spring-AI, which support Oracle Database 23ai! I recommend try it all on Oracle Database Free.

Final Thoughts

There’s no “silver bullet” when it comes to effectively utilizing vector databases for similarity search. The effectiveness of similarity search depends heavily on input data quality, the vector database implementation, and the algorithms used. Additionally, similarity search may or may not be appropriate for a given query — sometimes it’s better to filter by exact results, or use hybrid query models!

Effectively leveraging vector databases requires experimentation with different embedding models, vector indexes, and similarity metrics. The best approach depends on your dataset and retrieval goals — whether that’s search, recommendations, or anomaly detection. To that end, Oracle has developed the 🔗 AI Microservices Sandbox to help developers experiment with vector databases and GenAI.

💡 Want to experiment hands-on? Oracle’s AI Optimizer and Toolkit Sandbox provides a powerful environment to test vector search with your own private data. My colleague Corrado De Bari has written an excellent guide on how to get started: 🔗 AI Optimizer and Toolkit Guide

Oracle AI Microservices Sandbox description highlighting its capabilities in exploring Generative Artificial Intelligence and Retrieval-Augmented Generation. — https://github.com/oracle-samples/oaim-sandbox

andersswanson.dev