Engineering

•

June 27, 2025

How to Build a Resume Parser with Ragie's Entity Extraction

Bob Remeika

Co-founder & CEO

Manual resume screening is a time-consuming task for HR professionals, especially when processing hundreds of applications for a single position. Traditional parsing solutions often fail with different resume formats and layouts, making automated resume processing a challenging task for most organizations.

This article highlights the entity extraction capabilities of Ragie, a fully managed RAG-as-a-service platform that includes powerful document processing features. The tutorial portion of this post will demonstrate how to use Ragie's entity extraction API to build a resume parser application that extracts structured data from unstructured resume documents.

‍

Prerequisites

To follow along with this tutorial, you'll need the following:

Prior experience with Python programming
Python 3.7+ installed on your local development environment
A Ragie account with API access
Basic understanding of working with APIs and environment variables

The code for this tutorial is available on GitHub. Feel free to clone it to follow along.

‍

Challenges in Conventional Resume Processing

Traditional resume parsing approaches face several limitations when processing diverse resume formats:

Most conventional parsers rely on rigid templates that break when encountering non-standard layouts or creative resume designs.
Traditional tools struggle to distinguish between different types of information when they appear in unexpected locations within the document.
Even when extraction succeeds, the output often consists of unstructured text chunks rather than organized, structured data.

Ragie's entity extraction feature addresses these challenges by leveraging LLM-powered extraction that understands document semantics and produces consistently structured output according to predefined schemas.

‍

Building the Resume Parser

Now that we understand the foundation, we'll build a resume parser application using Ragie's built-in entity extraction feature. We'll create a complete solution that processes multiple resume formats and outputs structured data.

‍

Step 1: Project Setup

First, we'll set up our development environment. Create a new directory for the project:

mkdir resume-parser-ragie
cd resume-parser-ragie

‍

Create a requirements.txt file with the necessary dependencies:

ragie
streamlit
python-dotenv

‍

Install the required packages:

pip install -r requirements.txt

‍

Next, create a .env file to store your Ragie API credentials:

RAGIE_AUTH_TOKEN=your_ragie_auth_token_here

‍

To obtain your Ragie authentication token, log into your Ragie dashboard and navigate to the API Keys section. Create a new key if you don't have one already.

Step 2: Define the Entity Extraction Schema

We'll create the core parser functionality. In the project directory, create a new file called resume_parser.py and add the following:

import os
from dotenv import load_dotenv
from ragie import Ragie
import json
import time

# Load environment variables
load_dotenv()

class ResumeParser:
    def __init__(self):
        self.client = Ragie(auth=os.getenv("RAGIE_AUTH_TOKEN"))
        self.instruction_id = None

‍

Here we initialize the ResumeParser class with the Ragie client using our API credentials. The instruction_id will store the reference to our extraction schema.

Then, we'll define the entity extraction instruction that tells Ragie exactly what information to extract. Add this method to the ResumeParser class:

def create_extraction_instruction(self):
    """Create the entity extraction instruction for resume parsing"""

    # Define the entity schema using JSON Schema format
    entity_schema = {
        "type": "object",
        "properties": {
            "firstName": {
                "type": "string",
                "description": "First name of the candidate"
            },
            "lastName": {
                "type": "string",
                "description": "Last name of the candidate"
            },
            "email": {
                "type": "string",
                "description": "Email address"
            },
            "phone": {
                "type": "string",
                "description": "Phone number"
            },
            "location": {
                "type": "string",
                "description": "City, State or full address"
            },
            "summary": {
                "type": "string",
                "description": "Professional summary or objective statement"
            },
            "skills": {
                "type": "array",
                "items": {"type": "string"},
                "description": "Array of technical and professional skills"
            },
            "experience": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "company": {"type": "string", "description": "Company name"},
                        "position": {"type": "string", "description": "Job title/position"},
                        "duration": {"type": "string", "description": "Employment duration"},
                        "description": {"type": "string", "description": "Brief job description"}
                    }
                },
                "description": "Array of work experience entries"
            },
            "education": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "institution": {"type": "string", "description": "School/University name"},
                        "degree": {"type": "string", "description": "Degree type and field"},
                        "graduationYear": {"type": "string", "description": "Year of graduation"}
                    }
                },
                "description": "Array of education entries"
            },
            "certifications": {
                "type": "array",
                "items": {"type": "string"},
                "description": "Array of professional certifications"
            }
        }
    }

    try:
        response = self.client.entities.create_instruction(request={
            "name": f"Resume Parser {int(time.time())}",
            "prompt": "Extract structured information from resume documents including personal details, skills, experience, education, and certifications. If any field is not found, set it to null or empty array as appropriate.",
            "entity_schema": entity_schema
        })
        self.instruction_id = response.id
        print(f"Instruction created successfully with ID: {self.instruction_id}")
        return response
    except Exception as e:
        print(f"Error creating instruction: {str(e)}")
        return None

‍

The above method creates a comprehensive extraction schema using JSON Schema format. This schema defines all the essential resume components we want to extract: personal information, skills, work experience, education, and certifications. Ragie will use this schema to structure the extracted data consistently.

‍

Step 4: Upload Resumes & Parse Entities

Now we'll implement the document upload and parsing functionality. Add this method to the ResumeParser class:

def upload_and_parse_resume(self, file_path):
        """Upload a resume document and extract entities"""
        try:
            # Ensure we have an instruction
            if not self.instruction_id:
                print("No instruction found. Creating extraction instruction...")
                self.create_extraction_instruction()

            # Upload the document using the documents API
            with open(file_path, 'rb') as file:
                upload_response = self.client.documents.create(request={
                    "file": {
                        "file_name": os.path.basename(file_path),
                        "content": file,
                    }
                })

            document_id = upload_response.id
            print(f"Document uploaded successfully with ID: {document_id}")

            # Wait for document processing
            time.sleep(8)

            # Get extracted entities using the entities API
            extraction_response = self.client.entities.list_by_document(request={
                "document_id": "15895021-4a85-439e-a65b-97e884f785c6"
            })

            print("extraction_response", extraction_response)

            # Parse the response based on the actual API structure
            if hasattr(extraction_response, 'result') and hasattr(extraction_response.result, 'entities'):
                entities = extraction_response.result.entities
                if entities and len(entities) > 0:
                    # Merge data from all entities to create complete profile
                    merged_data = {}
                    for entity in entities:
                        if hasattr(entity, 'data') and entity.data:
                            for key, value in entity.data.items():
                                if value:  # Only use non-empty values
                                    if key in merged_data:
                                        if isinstance(value, list) and isinstance(merged_data[key], list):
                                            merged_data[key].extend(value)
                                        elif merged_data[key] is None:
                                            merged_data[key] = value
                                    else:
                                        merged_data[key] = value

                    # Remove duplicates from arrays
                    for key in ["skills", "certifications"]:
                        if key in merged_data and isinstance(merged_data[key], list):
                            merged_data[key] = list(set(merged_data[key]))

                    print(f"Successfully extracted and merged data from {len(entities)} entities")
                    return merged_data

            print("No entities extracted. Document may still be processing.")
            return None

        except Exception as e:
            print(f"Error processing resume: {str(e)}")
            return None

    def parse_multiple_resumes(self, file_paths):
        """Parse multiple resume files"""
        results = []

        for file_path in file_paths:
            print(f"Processing: {file_path}")
            extracted_data = self.upload_and_parse_resume(file_path)
            print("extracted_data", extracted_data)

            if extracted_data:
                results.append({
                    "file_name": os.path.basename(file_path),
                    "extracted_data": extracted_data
                })

        return results

‍

The above methods handle the core document processing workflow. The upload_and_parse_resume method uploads documents to Ragie, waits for processing, and retrieves the extracted entities. The merging logic combines data from multiple entities to create a complete candidate profile, removing duplicates from skill and certification arrays.

‍

Step 5: Building the Streamlit Interface

Next, we'll create a web interface for the resume parser with a sidebar-based layout. Create a new file called app.py and add the following code snippets:

import streamlit as st
import pandas as pd
from resume_parser import ResumeParser
import json
import tempfile
import os

def main():
    st.set_page_config(
        page_title="Resume Parser with Ragie",
        page_icon="📄",
        layout="wide"
    )

    st.title("📄 Resume Parser with Ragie")
    st.markdown("Upload resume files and extract structured information automatically!")

    # Initialize the resume parser
    if 'parser' not in st.session_state:
        st.session_state.parser = ResumeParser()

    # Sidebar for configuration and upload
    with st.sidebar:
        st.header("Configuration")

        if st.button("Create Extraction Schema"):
            with st.spinner("Creating extraction instruction..."):
                result = st.session_state.parser.create_extraction_instruction()
                if result:
                    st.success(f"Schema created! ID: {result.id}")
                else:
                    st.error("Failed to create schema")

        if st.button("List Available Instructions"):
            with st.spinner("Loading instructions..."):
                instructions = st.session_state.parser.list_available_instructions()
                if instructions:
                    st.write("Available Instructions:")
                    for instruction in instructions:
                        st.write(f"- {instruction.name}")

        st.divider()

        st.header("Upload Resumes")

        uploaded_files = st.file_uploader(
            "Choose resume files",
            type=['pdf', 'docx', 'txt'],
            accept_multiple_files=True,
            help="Upload PDF, DOCX, or TXT resume files"
        )

        if uploaded_files and st.button("Parse Resumes", type="primary", use_container_width=True):
            with st.spinner("Processing resumes..."):
                results = []
                progress_bar = st.progress(0)

                for i, uploaded_file in enumerate(uploaded_files):
                    st.write(f"Processing: {uploaded_file.name}")

                    # Save uploaded file temporarily
                    with tempfile.NamedTemporaryFile(delete=False, suffix=f".{uploaded_file.name.split('.')[-1]}") as tmp_file:
                        tmp_file.write(uploaded_file.getvalue())
                        tmp_file_path = tmp_file.name

                    # Parse the resume
                    extracted_data = st.session_state.parser.upload_and_parse_resume(tmp_file_path)

                    if extracted_data:
                        results.append({
                            "file_name": uploaded_file.name,
                            "extracted_data": extracted_data
                        })
                    else:
                        st.warning(f"Could not extract data from {uploaded_file.name}")

                    # Clean up temporary file
                    os.unlink(tmp_file_path)

                    # Update progress
                    progress_bar.progress((i + 1) / len(uploaded_files))

                st.session_state.parsing_results = results
                st.success(f"Successfully processed {len(results)} resume(s)")

‍

Here we set up the basic Streamlit application structure with sidebar-based configuration and file upload functionality. The interface organizes all controls in the left sidebar and allows users to upload multiple resume files and track processing progress.

Next, update the code in the app.py file to add the results display section:

    # Main content area - centered results
    if 'parsing_results' in st.session_state and st.session_state.parsing_results:
        st.header("Parsing Results")

        for i, result in enumerate(st.session_state.parsing_results):
            with st.expander(f"📄 {result['file_name']}", expanded=i==0):
                data = result['extracted_data']

                # Personal Information
                st.subheader("Personal Information")
                col_a, col_b = st.columns(2)

                with col_a:
                    first_name = data.get('firstName', 'N/A')
                    last_name = data.get('lastName', 'N/A')
                    full_name = f"{first_name} {last_name}".strip()
                    st.write(f"**Name:** {full_name}")
                    st.write(f"**Email:** {data.get('email', 'N/A')}")

                with col_b:
                    st.write(f"**Phone:** {data.get('phone', 'N/A')}")
                    st.write(f"**Location:** {data.get('location', 'N/A')}")

                # Summary
                if data.get('summary'):
                    st.subheader("Summary")
                    st.write(data['summary'])

                # Skills
                if data.get('skills') and len(data['skills']) > 0:
                    st.subheader("Skills")
                    skills_text = ", ".join(data['skills'])
                    st.write(skills_text)

                # Experience
                if data.get('experience') and len(data['experience']) > 0:
                    st.subheader("Experience")
                    for exp in data['experience']:
                        st.write(f"**{exp.get('position', 'N/A')}** at {exp.get('company', 'N/A')}")
                        if exp.get('duration'):
                            st.write(f"Duration: {exp['duration']}")
                        if exp.get('description'):
                            st.write(f"Description: {exp['description']}")
                        st.write("---")

                # Education
                if data.get('education') and len(data['education']) > 0:
                    st.subheader("Education")
                    for edu in data['education']:
                        degree = edu.get('degree', 'N/A')
                        institution = edu.get('institution', 'N/A')
                        st.write(f"**{degree}** from {institution}")
                        if edu.get('graduationYear'):
                            st.write(f"Graduated: {edu['graduationYear']}")
                        st.write("---")

                # Certifications
                if data.get('certifications') and len(data['certifications']) > 0:
                    st.subheader("Certifications")
                    for cert in data['certifications']:
                        st.write(f"• {cert}")

                # Raw JSON data
                if st.checkbox("Show Raw JSON Data", key=f"json_{result['file_name']}"):
                    st.json(data)

‍

The above code creates the results display interface with organized sections for each extracted data type. The main content area now utilizes the full width to display structured information in an expandable format.

Finally, complete the app.py file by adding the export functionality and welcome section:

        # Export functionality
        st.subheader("Export Results")
        col_export1, col_export2 = st.columns(2)

        with col_export1:
            if st.button("Export as JSON"):
                json_data = json.dumps(st.session_state.parsing_results, indent=2)
                st.download_button(
                    label="Download JSON",
                    data=json_data,
                    file_name="resume_parsing_results.json",
                    mime="application/json"
                )

        with col_export2:
            if st.button("Export as CSV"):
                # Flatten data for CSV export
                flattened_data = []
                for result in st.session_state.parsing_results:
                    data = result['extracted_data']
                    row = {
                        'file_name': result['file_name'],
                        'first_name': data.get('firstName', ''),
                        'last_name': data.get('lastName', ''),
                        'email': data.get('email', ''),
                        'phone': data.get('phone', ''),
                        'location': data.get('location', ''),
                        'summary': data.get('summary', ''),
                        'skills': ', '.join(data.get('skills', [])),
                        'certifications': ', '.join(data.get('certifications', []))
                    }
                    flattened_data.append(row)

                df = pd.DataFrame(flattened_data)
                csv_data = df.to_csv(index=False)
                st.download_button(
                    label="Download CSV",
                    data=csv_data,
                    file_name="resume_parsing_results.csv",
                    mime="text/csv"
                )

    else:
        st.info("Upload and parse resumes to see results here.")
        st.markdown("""
        **How to get started:**
        1. Click 'Create Extraction Schema' in the sidebar
        2. Upload one or more resume files
        3. Click 'Parse Resumes' to extract structured data
        4. View and export the results
        """)

if __name__ == "__main__":
    main()

‍

The above code completes the interface with export functionality and welcome instructions. Users can now export results in JSON or CSV formats and receive clear guidance on how to get started with the application.

‍Step 6: Running the Application

To run the application, use the following command:

streamlit run app.py

‍

The application will start and open in your browser. To test the parser, first click Create Extraction Schema in the sidebar, then upload resume files and click Parse Resumes to see the structured output.

Final Thoughts

Throughout this tutorial, we built a comprehensive resume parser that transforms unstructured resume documents into organized, structured data. We demonstrated how to define custom extraction schemas using Ragie's entity extraction API, process multiple document formats simultaneously, and create a Streamlit interface for viewing and exporting results.

The power of Ragie's entity extraction feature lies in its flexibility and intelligence. Unlike traditional parsing tools that rely on rigid templates, Ragie's LLM-powered extraction understands document context and can adapt to different resume formats while maintaining a consistent output structure. This approach eliminates the complexity of pattern matching and template configuration that often breaks with non-standard document layouts.

Consider extending this application with additional features. You could expand the extraction schema to capture more detailed information such as specific job responsibilities, salary expectations, or career progression patterns.

‍