Brief Summary
This workshop focuses on using graph data science to improve Retrieval Augmented Generation (RAG) applications. It explores how graphs provide new ways of understanding data, emphasizing the importance of graph algorithms. The session aims to equip AI engineers with the knowledge to leverage graph algorithms for better data management, relationship understanding, and overall application performance.
- Using graph data science to improve RAG applications.
- Understanding graph algorithms for better data management.
- Connecting structured and unstructured data using knowledge graphs.
- Managing data volume and relationships effectively.
- Improving the quality and diversity of data in AI applications.
Introduction and Goals
Allison Cosette from the developer relations team introduces the workshop's focus on using graph data science to improve RAG applications. The goal is to familiarize attendees with graph algorithms and demonstrate their practical applications in enhancing data understanding and application performance. The session encourages participants to bring real-world scenarios to explore how graph can address their specific challenges.
Challenges in RAG Applications
The discussion addresses common challenges in RAG applications, including data volume, relationship understanding, and temporal relationships. The limitations of vector stores are highlighted, noting that they often lack the ability to provide context and connections between data points. Knowledge graphs are presented as a solution to bring structured and unstructured data together, find knowledge within unstructured documents, and create explicit and implicit relationships among data pieces.
Understanding Knowledge Graphs
A knowledge graph consists of nodes (entities or nouns), relationships (connections between entities), and properties (details or attributes of entities and relationships). Embeddings, which represent the semantics of text, are connected to other information within the graph, enabling quick traversal and immediate context retrieval. This structure allows for efficient data retrieval and a comprehensive understanding of the relationships between different pieces of information.
Graph RAG Architecture
The architecture of a graph RAG system includes documents, chunks, and an application graph that tracks the activity within the system. This setup connects the memory graph or application to chunks and then to the domain, products, people, and other structured elements extracted from unstructured documents. The result is a fully connected network of data and system activity, enabling a cohesive approach to development that integrates data and application perspectives.
Clarifying Graph Components and Addressing Scalability
The different components of the graph, such as the application graph, unstructured data, and structured data, are neighborhoods within the same system. The key is to connect the application to all data sources, allowing structured and unstructured data to work together. For large graphs, it's advised to start with generic terms and refine as needed, iterating on the schema with a subset of the data.
Expanding Answerable Questions with Graph RAG
Graph RAG expands the number of answerable questions for a given dataset by creating relationships between chunks of data. This approach moves from a linear relationship (one chunk = one question) to a potentially exponential one (pairwise connections and community detection). Connecting all chunks allows for community detection, where subsets of connections provide new information and insights.
Setting Up Neo4j and Loading Data
Instructions are provided for setting up a Neo4j instance using console.neo4j.io, creating a pro trial, and loading a provided data dump. The process involves creating a new instance, selecting the Ora DB Professional option, and using the backup and restore feature to upload the dump file. This setup prepares the environment for exploring graph algorithms and data analysis.
Exploring the Database with Cipher
The session demonstrates how to use Cipher, Neo4j's query language, to explore the database. Cipher focuses on pattern matching, allowing users to query relationships between nodes. The data model includes sessions, user prompts, assistant messages, and context documents, illustrating the flow of conversations and the retrieval of relevant information.
Analyzing Conversation Patterns
Analysis of conversation patterns reveals how users navigate through data. By tracking the context documents used in different conversations, it's possible to understand the movement of information and identify frequently accessed knowledge areas. This insight is valuable for managing documents at scale and determining which content is most important.
Storing Embeddings and Addressing Performance
Embeddings are stored as properties of nodes, eliminating the need for a separate vector database. For large legal documents, chunks and vectors can be stored in Neo4j, with performance remaining stable for tens of millions of documents. Multiple vector indices can be used to manage data governance and filter data based on specific use cases.
Handling Multimodal Data and Filtering Vector Indices
For documents containing images, the images are stored separately in a service like S3, with the URL and embedding stored on the node. Multiple vector indices can address temporal issues by creating indices for different versions of the data. Metadata filtering is performed using the property graph itself, allowing for predicates on nodes before creating the index.
Notebook 1: Connecting to the Graph
The first notebook focuses on connecting to the graph and ensuring the connection is working. It involves importing necessary libraries, setting up environment variables, and using the Python driver to connect to Neo4j. Basic summary statistics are obtained to verify that the data has been loaded correctly.
Connect, Cluster, Curate: Managing Data at Scale
The "connect, cluster, curate" process is introduced as a method for managing data at scale. This involves running K-Nearest Neighbors (KNN) to create connections among similar documents, using community detection to group like items, and curating the grounding dataset based on these techniques. The goal is to improve the relevance, reliability, and efficiency of the data used in RAG applications.
Understanding Community Detection
Community detection is based on modularity optimization, where communities have high internal interconnectedness and low external connections. Algorithms like Louvain and Label Propagation are used to identify these communities. Label Propagation is a fast algorithm that assigns labels to nodes and propagates them to neighbors, while modularity-based algorithms look for density in the graph.
Implementing KNN and Community Detection in Code
The process begins with projecting a graph to run KNN on documents, creating a similarity graph of context documents connected to their source URLs. The projection involves specifying the nodes (documents) and including the embedding property. The KNN algorithm is then run to create relationships among similar documents.
Analyzing Community Clusters
Analyzing community clusters helps identify high-quality grounding datasets. A single community cluster indicates high similarity among documents, which may not be ideal for diversity. High-quality grounding datasets should be relevant, reliable, disparate, and efficient. Community-level statistics, such as median word length and average similarity, can reveal issues like irrelevant data or highly similar text chunks.
Managing Data Quality and Diversity
Techniques for improving data quality and diversity include removing irrelevant data, collapsing highly similar text chunks, and using reranking for diversity in responses. Collapsing nodes maintains lineage and connections while reducing redundancy. Reranking can prioritize diversity or page rank to provide more varied and relevant results.
Creating Co-occurrence Relationships
Creating co-occurrence relationships involves identifying documents that are frequently retrieved together in response to a question. This helps understand which concepts are related and how they travel around the graph. Co-occurrence can be used to predict the next steps in a conversation and improve the relevance of responses.
Accountability and Traceability in AI
The discussion emphasizes the importance of accountability and traceability in AI, particularly in sensitive domains like defense. It highlights the need to verify AI outputs and be thoughtful about the signals amplified in AI systems. The goal is to build smart applications that deliver good outcomes and avoid unintended consequences.
Building Knowledge Graphs at Scale
Building knowledge graphs at scale involves using tools like the KG builder to automatically extract entities and relationships from documents. This organic approach allows the ontology to rise from the data, revealing valuable insights that may not be apparent otherwise. Running algorithms like page rank and betweenness centrality can help understand the importance of different entities and relationships.
Content Hygiene and Graph Analytics
Graph analytics can help identify potential issues with content hygiene, such as contradictory or inaccurate information. By examining the connections between different pieces of content, it's possible to detect anomalies and inconsistencies. The speaker offers to discuss specific scenarios and provide tailored recommendations.
Leveraging Embeddings and Graph Structure
Combining text embeddings with node embeddings (embeddings of the graph structure) can provide richer insights. Techniques like page rank and vector reranking can be used to leverage both types of embeddings. The speaker offers to provide more specific guidance and resources on this topic.
Understanding Community Detection (Continued)
Community detection identifies interconnected groups of nodes, where nodes within a community have strong relationships with each other. This technique is used for production-scale curation and understanding, allowing for more diverse and relevant answers. The community itself may not be part of the answer, but it can inform how the retriever spreads the net and brings back more varied results.