
How to Build a Resume Parser with Ragie's Entity Extraction
Manual resume screening is a time-consuming task for HR professionals, especially when processing hundreds of applications for a single position. Traditional parsing solutions often fail with different resume formats and layouts, making automated resume processing a challenging task for most organizations.
This article highlights the entity extraction capabilities of Ragie, a fully managed RAG-as-a-service platform that includes powerful document processing features. The tutorial portion of this post will demonstrate how to use Ragie's entity extraction API to build a resume parser application that extracts structured data from unstructured resume documents.
Prerequisites
To follow along with this tutorial, you'll need the following:
- Prior experience with Python programming
- Python 3.7+ installed on your local development environment
- A Ragie account with API access
- Basic understanding of working with APIs and environment variables
The code for this tutorial is available on GitHub. Feel free to clone it to follow along.
Challenges in Conventional Resume Processing
Traditional resume parsing approaches face several limitations when processing diverse resume formats:
- Most conventional parsers rely on rigid templates that break when encountering non-standard layouts or creative resume designs.
- Traditional tools struggle to distinguish between different types of information when they appear in unexpected locations within the document.
- Even when extraction succeeds, the output often consists of unstructured text chunks rather than organized, structured data.
Ragie's entity extraction feature addresses these challenges by leveraging LLM-powered extraction that understands document semantics and produces consistently structured output according to predefined schemas.
Building the Resume Parser
Now that we understand the foundation, we'll build a resume parser application using Ragie's built-in entity extraction feature. We'll create a complete solution that processes multiple resume formats and outputs structured data.
Step 1: Project Setup
First, we'll set up our development environment. Create a new directory for the project:
mkdir resume-parser-ragie
cd resume-parser-ragie
Create a requirements.txt file with the necessary dependencies:
ragie
streamlit
python-dotenv
Install the required packages:
pip install -r requirements.txt
Next, create a .env file to store your Ragie API credentials:
RAGIE_AUTH_TOKEN=your_ragie_auth_token_here
To obtain your Ragie authentication token, log into your Ragie dashboard and navigate to the API Keys section. Create a new key if you don't have one already.

Step 2: Define the Entity Extraction Schema
We'll create the core parser functionality. In the project directory, create a new file called resume_parser.py
and add the following:
import os
from dotenv import load_dotenv
from ragie import Ragie
import json
import time
# Load environment variables
load_dotenv()
class ResumeParser:
def __init__(self):
self.client = Ragie(auth=os.getenv("RAGIE_AUTH_TOKEN"))
self.instruction_id = None
Here we initialize the ResumeParser
class with the Ragie client using our API credentials. The instruction_id will store the reference to our extraction schema.
Then, we'll define the entity extraction instruction that tells Ragie exactly what information to extract. Add this method to the ResumeParser
class:
def create_extraction_instruction(self):
"""Create the entity extraction instruction for resume parsing"""
# Define the entity schema using JSON Schema format
entity_schema = {
"type": "object",
"properties": {
"firstName": {
"type": "string",
"description": "First name of the candidate"
},
"lastName": {
"type": "string",
"description": "Last name of the candidate"
},
"email": {
"type": "string",
"description": "Email address"
},
"phone": {
"type": "string",
"description": "Phone number"
},
"location": {
"type": "string",
"description": "City, State or full address"
},
"summary": {
"type": "string",
"description": "Professional summary or objective statement"
},
"skills": {
"type": "array",
"items": {"type": "string"},
"description": "Array of technical and professional skills"
},
"experience": {
"type": "array",
"items": {
"type": "object",
"properties": {
"company": {"type": "string", "description": "Company name"},
"position": {"type": "string", "description": "Job title/position"},
"duration": {"type": "string", "description": "Employment duration"},
"description": {"type": "string", "description": "Brief job description"}
}
},
"description": "Array of work experience entries"
},
"education": {
"type": "array",
"items": {
"type": "object",
"properties": {
"institution": {"type": "string", "description": "School/University name"},
"degree": {"type": "string", "description": "Degree type and field"},
"graduationYear": {"type": "string", "description": "Year of graduation"}
}
},
"description": "Array of education entries"
},
"certifications": {
"type": "array",
"items": {"type": "string"},
"description": "Array of professional certifications"
}
}
}
try:
response = self.client.entities.create_instruction(request={
"name": f"Resume Parser {int(time.time())}",
"prompt": "Extract structured information from resume documents including personal details, skills, experience, education, and certifications. If any field is not found, set it to null or empty array as appropriate.",
"entity_schema": entity_schema
})
self.instruction_id = response.id
print(f"Instruction created successfully with ID: {self.instruction_id}")
return response
except Exception as e:
print(f"Error creating instruction: {str(e)}")
return None
The above method creates a comprehensive extraction schema using JSON Schema format. This schema defines all the essential resume components we want to extract: personal information, skills, work experience, education, and certifications. Ragie will use this schema to structure the extracted data consistently.
Step 4: Upload Resumes & Parse Entities
Now we'll implement the document upload and parsing functionality. Add this method to the ResumeParser
class:
def upload_and_parse_resume(self, file_path):
"""Upload a resume document and extract entities"""
try:
# Ensure we have an instruction
if not self.instruction_id:
print("No instruction found. Creating extraction instruction...")
self.create_extraction_instruction()
# Upload the document using the documents API
with open(file_path, 'rb') as file:
upload_response = self.client.documents.create(request={
"file": {
"file_name": os.path.basename(file_path),
"content": file,
}
})
document_id = upload_response.id
print(f"Document uploaded successfully with ID: {document_id}")
# Wait for document processing
time.sleep(8)
# Get extracted entities using the entities API
extraction_response = self.client.entities.list_by_document(request={
"document_id": "15895021-4a85-439e-a65b-97e884f785c6"
})
print("extraction_response", extraction_response)
# Parse the response based on the actual API structure
if hasattr(extraction_response, 'result') and hasattr(extraction_response.result, 'entities'):
entities = extraction_response.result.entities
if entities and len(entities) > 0:
# Merge data from all entities to create complete profile
merged_data = {}
for entity in entities:
if hasattr(entity, 'data') and entity.data:
for key, value in entity.data.items():
if value: # Only use non-empty values
if key in merged_data:
if isinstance(value, list) and isinstance(merged_data[key], list):
merged_data[key].extend(value)
elif merged_data[key] is None:
merged_data[key] = value
else:
merged_data[key] = value
# Remove duplicates from arrays
for key in ["skills", "certifications"]:
if key in merged_data and isinstance(merged_data[key], list):
merged_data[key] = list(set(merged_data[key]))
print(f"Successfully extracted and merged data from {len(entities)} entities")
return merged_data
print("No entities extracted. Document may still be processing.")
return None
except Exception as e:
print(f"Error processing resume: {str(e)}")
return None
def parse_multiple_resumes(self, file_paths):
"""Parse multiple resume files"""
results = []
for file_path in file_paths:
print(f"Processing: {file_path}")
extracted_data = self.upload_and_parse_resume(file_path)
print("extracted_data", extracted_data)
if extracted_data:
results.append({
"file_name": os.path.basename(file_path),
"extracted_data": extracted_data
})
return results
The above methods handle the core document processing workflow. The upload_and_parse_resume
method uploads documents to Ragie, waits for processing, and retrieves the extracted entities. The merging logic combines data from multiple entities to create a complete candidate profile, removing duplicates from skill and certification arrays.
Step 5: Building the Streamlit Interface
Next, we'll create a web interface for the resume parser with a sidebar-based layout. Create a new file called app.py
and add the following code snippets:
import streamlit as st
import pandas as pd
from resume_parser import ResumeParser
import json
import tempfile
import os
def main():
st.set_page_config(
page_title="Resume Parser with Ragie",
page_icon="📄",
layout="wide"
)
st.title("📄 Resume Parser with Ragie")
st.markdown("Upload resume files and extract structured information automatically!")
# Initialize the resume parser
if 'parser' not in st.session_state:
st.session_state.parser = ResumeParser()
# Sidebar for configuration and upload
with st.sidebar:
st.header("Configuration")
if st.button("Create Extraction Schema"):
with st.spinner("Creating extraction instruction..."):
result = st.session_state.parser.create_extraction_instruction()
if result:
st.success(f"Schema created! ID: {result.id}")
else:
st.error("Failed to create schema")
if st.button("List Available Instructions"):
with st.spinner("Loading instructions..."):
instructions = st.session_state.parser.list_available_instructions()
if instructions:
st.write("Available Instructions:")
for instruction in instructions:
st.write(f"- {instruction.name}")
st.divider()
st.header("Upload Resumes")
uploaded_files = st.file_uploader(
"Choose resume files",
type=['pdf', 'docx', 'txt'],
accept_multiple_files=True,
help="Upload PDF, DOCX, or TXT resume files"
)
if uploaded_files and st.button("Parse Resumes", type="primary", use_container_width=True):
with st.spinner("Processing resumes..."):
results = []
progress_bar = st.progress(0)
for i, uploaded_file in enumerate(uploaded_files):
st.write(f"Processing: {uploaded_file.name}")
# Save uploaded file temporarily
with tempfile.NamedTemporaryFile(delete=False, suffix=f".{uploaded_file.name.split('.')[-1]}") as tmp_file:
tmp_file.write(uploaded_file.getvalue())
tmp_file_path = tmp_file.name
# Parse the resume
extracted_data = st.session_state.parser.upload_and_parse_resume(tmp_file_path)
if extracted_data:
results.append({
"file_name": uploaded_file.name,
"extracted_data": extracted_data
})
else:
st.warning(f"Could not extract data from {uploaded_file.name}")
# Clean up temporary file
os.unlink(tmp_file_path)
# Update progress
progress_bar.progress((i + 1) / len(uploaded_files))
st.session_state.parsing_results = results
st.success(f"Successfully processed {len(results)} resume(s)")
Here we set up the basic Streamlit application structure with sidebar-based configuration and file upload functionality. The interface organizes all controls in the left sidebar and allows users to upload multiple resume files and track processing progress.
Next, update the code in the app.py
file to add the results display section:
# Main content area - centered results
if 'parsing_results' in st.session_state and st.session_state.parsing_results:
st.header("Parsing Results")
for i, result in enumerate(st.session_state.parsing_results):
with st.expander(f"📄 {result['file_name']}", expanded=i==0):
data = result['extracted_data']
# Personal Information
st.subheader("Personal Information")
col_a, col_b = st.columns(2)
with col_a:
first_name = data.get('firstName', 'N/A')
last_name = data.get('lastName', 'N/A')
full_name = f"{first_name} {last_name}".strip()
st.write(f"**Name:** {full_name}")
st.write(f"**Email:** {data.get('email', 'N/A')}")
with col_b:
st.write(f"**Phone:** {data.get('phone', 'N/A')}")
st.write(f"**Location:** {data.get('location', 'N/A')}")
# Summary
if data.get('summary'):
st.subheader("Summary")
st.write(data['summary'])
# Skills
if data.get('skills') and len(data['skills']) > 0:
st.subheader("Skills")
skills_text = ", ".join(data['skills'])
st.write(skills_text)
# Experience
if data.get('experience') and len(data['experience']) > 0:
st.subheader("Experience")
for exp in data['experience']:
st.write(f"**{exp.get('position', 'N/A')}** at {exp.get('company', 'N/A')}")
if exp.get('duration'):
st.write(f"Duration: {exp['duration']}")
if exp.get('description'):
st.write(f"Description: {exp['description']}")
st.write("---")
# Education
if data.get('education') and len(data['education']) > 0:
st.subheader("Education")
for edu in data['education']:
degree = edu.get('degree', 'N/A')
institution = edu.get('institution', 'N/A')
st.write(f"**{degree}** from {institution}")
if edu.get('graduationYear'):
st.write(f"Graduated: {edu['graduationYear']}")
st.write("---")
# Certifications
if data.get('certifications') and len(data['certifications']) > 0:
st.subheader("Certifications")
for cert in data['certifications']:
st.write(f"• {cert}")
# Raw JSON data
if st.checkbox("Show Raw JSON Data", key=f"json_{result['file_name']}"):
st.json(data)
The above code creates the results display interface with organized sections for each extracted data type. The main content area now utilizes the full width to display structured information in an expandable format.

Finally, complete the app.py file by adding the export functionality and welcome section:
# Export functionality
st.subheader("Export Results")
col_export1, col_export2 = st.columns(2)
with col_export1:
if st.button("Export as JSON"):
json_data = json.dumps(st.session_state.parsing_results, indent=2)
st.download_button(
label="Download JSON",
data=json_data,
file_name="resume_parsing_results.json",
mime="application/json"
)
with col_export2:
if st.button("Export as CSV"):
# Flatten data for CSV export
flattened_data = []
for result in st.session_state.parsing_results:
data = result['extracted_data']
row = {
'file_name': result['file_name'],
'first_name': data.get('firstName', ''),
'last_name': data.get('lastName', ''),
'email': data.get('email', ''),
'phone': data.get('phone', ''),
'location': data.get('location', ''),
'summary': data.get('summary', ''),
'skills': ', '.join(data.get('skills', [])),
'certifications': ', '.join(data.get('certifications', []))
}
flattened_data.append(row)
df = pd.DataFrame(flattened_data)
csv_data = df.to_csv(index=False)
st.download_button(
label="Download CSV",
data=csv_data,
file_name="resume_parsing_results.csv",
mime="text/csv"
)
else:
st.info("Upload and parse resumes to see results here.")
st.markdown("""
**How to get started:**
1. Click 'Create Extraction Schema' in the sidebar
2. Upload one or more resume files
3. Click 'Parse Resumes' to extract structured data
4. View and export the results
""")
if __name__ == "__main__":
main()
The above code completes the interface with export functionality and welcome instructions. Users can now export results in JSON or CSV formats and receive clear guidance on how to get started with the application.

Step 6: Running the Application
To run the application, use the following command:
streamlit run app.py
The application will start and open in your browser. To test the parser, first click Create Extraction Schema in the sidebar, then upload resume files and click Parse Resumes to see the structured output.

Final Thoughts
Throughout this tutorial, we built a comprehensive resume parser that transforms unstructured resume documents into organized, structured data. We demonstrated how to define custom extraction schemas using Ragie's entity extraction API, process multiple document formats simultaneously, and create a Streamlit interface for viewing and exporting results.
The power of Ragie's entity extraction feature lies in its flexibility and intelligence. Unlike traditional parsing tools that rely on rigid templates, Ragie's LLM-powered extraction understands document context and can adapt to different resume formats while maintaining a consistent output structure. This approach eliminates the complexity of pattern matching and template configuration that often breaks with non-standard document layouts.
Consider extending this application with additional features. You could expand the extraction schema to capture more detailed information such as specific job responsibilities, salary expectations, or career progression patterns.