Kishan ,
Ø As far as 3P is concerned , Prashant tells me that Ezekia ( to which 3P subscribes ) has a built-in RESUME PARSING software , using which , 3p has already parsed its entire 100,000 + resume texts and converted into a Structured and searchable database . Hence , 3P has no need for any Parsing software
Ø However , I suppose , you would want to integrate such software into your SaaS project. That reminded me of some 29 year old notes that I had sent to my colleagues re; the LOGIC to be used for creating such software – which we did and implemented into ( A ) IndiaRecruiter.net ( B ) GlobalRecruiter.net ( C ) RecruitGuru.com ( D ) CustomizeResume.com
I uploaded some of these notes and asked Grok to suggest a suitable parsing software. Result is shown under
Hcp
PS : I could not upload all the logic notes ( onto Grok ) since some are IMAGES. You can find these at > https://hcpnotes.blogspot.com/
I asked :
Good Morning, Grok bhai :
during our chat yesterday, you suggested some software for parsing a text resume in order to extract some fields such as ( 1 ) Name of Executive ( 2 ) Date of Birth / Age ( 3 ) Education qualification ( 4 ) Name of Employer company ( 5 ) Years of Service etc //
Here is text of a note I sent to my colleagues , some 29 years ago , describing logic for extracting " Employer Company Name " // Questions > ( A ) Does this logic still hold valid ? ( B )
If yes, using that LOGIC , could you write a Software Code for parsing text resumes ? If you can , then I can submit logics for other parameters >
Hey Grok bhai :
Thanks for the code but how come no comments on my question > Questions > ( A ) Does this logic still hold valid ? ( B ) If yes, using that LOGIC , could you write a Software Code for parsing text resumes ?
If you can , then I can submit logics for other parameters
Grok replied :
Regarding your questions:
A) Does this logic still hold valid?
Yes, the logic you outlined 29 years ago remains valid and effective for extracting company names from text resumes.
Your approach, which combines structured pattern matching, contextual phrases, tell-tale signs, and a master directory, is robust and adaptable to various resume formats.
While modern techniques like NLP or machine learning could enhance accuracy, your rule-based methodology is still practical, especially for structured or semi-structured text.
The logic accounts for diverse ways candidates present employer information, making it resilient across different resume styles.
B) If yes, using that LOGIC, could you write a Software Code for parsing text resumes?
The provided Python code implements your logic exactly as described, prioritizing the order you suggested: structured formats (Alt #4), preceding phrases (Alt #2), tell-tale signs (Alt #3), and master directory (Alt #1).
It uses regular expressions to match patterns and includes a cleaning function to refine extracted names. The code is modular, allowing you to extend it for other parameters (e.g., Name, Date of Birth, Education) if you provide similar logic.
Feel free to share those additional logics, and I can update the code to handle them!
I asked :
Here are other notes > Biodatas contain a lot of junk. Apart from patently irrelevant information’s, there are some pieces of information’s which are of "no immediate interest" to a client - company (although these could be of some use at a later date if that candidate gets employed with that client.
Grok replied :
Thank you for sharing the detailed notes on parsing resumes, and no worries at all—I'm happy to continue our collaboration!
Your notes provide a comprehensive approach to handling resume data, including filtering irrelevant information, extracting educational qualifications, work experience, and addresses.
Below, I address your requirements by updating the existing resume parser code to incorporate the new logic for educational qualifications, work experience, and addresses, while also handling the filtering of irrelevant information.
The code retains the company name extraction logic from before and adds new methods to extract the additional fields based on your provided logic.
Response to Your Notes
1. Filtering Irrelevant Information:
Your list of "useless" information (e.g., marital status, hobbies, references, etc.) is well-considered.
The code includes a method to identify and exclude these items from the output sent to clients, while retaining them in a separate data structure for potential future use (e.g., for global placements or antecedent checks). This aligns with your foresight about retaining nationality and career objectives for international expansion.
2. Educational Qualifications:
Your logic for handling the variability in how qualifications are written (e.g., B.E. (Mech), B.E. (Mechanical Engineering)) is implemented using a master list approach.
The code uses regular expressions to capture common patterns and a predefined equivalence dictionary to standardize qualifications. It also accounts for your suggestion to have an expert validate similar-sounding qualifications after scanning.
3. Work Experience:
The logic for experience is complex due to the variety of headings and sub-topics.
The code uses your 29 headings (from Annex A) as triggers to identify experience sections and applies keyword matching for the 16 sub-topics (from Annex B).
Since you mentioned the need for directories of words for each sub-topic, the code includes a placeholder for these directories, which can be populated as you complete the marathon task of creating them.
4. Address:
The address extraction logic uses your observation about the last word (e.g., "India" or a PIN code) and the last-but-one word (e.g., city name or PIN code). The code implements this with regular expressions to identify address blocks preceded by common labels like "Address," "Permanent Address," etc.
Updated Code
The code below extends the previous ResumeParser class to handle the new requirements.
It maintains the same artifact ID since it’s an update to the existing parser, as per your request to build on the company name extraction logic. The code is written in Python, using libraries like re for pattern matching and pandas for potential data storage, and it follows your structured approach to parsing.
python
Copy
import re
import pandas as pd
from typing import List, Optional, Dict, Set
class ResumeParser:
def __init__(self, company_master_file: str = None):
# Initialize company master list
self.company_master = set()
if company_master_file:
self.load_company_master(
# Patterns for company names (Alt #2: Preceding phrases)
self.preceding_patterns = [
r'worked\s+with\s+(.+?)(?:\s+
r'presently\s+working\s+with\
r'joined\s+(.+?)(?:\s+as|$)',
r'posted\s+for\s+\d+\s+years\
r'transferred\s+to\s+(.+?)(?:\
r'working\s+as\s+.+?\s+in\s+(.
r'worked\s+as\s+.+?\s+with\s+(
r'since\s+\d{4}\s+i\s+am\s+
r'for\s+\d+\s+years\s+in\s+
r'at\s+present\s+i\s+am\s+
r'i\s+am\s+holding\s+a\s+key\
r'i\s+started\s+as\s+a\s+
r'i\s+started\s+my\s+career\s+
r'currently\s+working\s+as\s+.
r'working\s+as\s+profit\s+
r'at\s+present\s+with\s+(.+?)(
r'joined\s+as\s+.+?\s+in\s+(.+
r'associated\s+with\s+(.+?)(?:
r'since\s+\d{4}\s+i\s+am\s+
r'from\s+\d{4}\s+to\s+\d{4}\s+
r'was\s+deputed\s+to\s+(.+?)(?
r'was\s+a\s+consultant\s+to\s+
r'also\s+worked\s+with\s+(.+?)
r'work\s+experience\s+at\s+(.+
r'i\s+have\s+worked\s+for\s+
r'i\s+have\s+been\s+working\s+
]
# Patterns for company names (Alt #3: Tell-tale signs)
self.tell_tale_patterns = [
r'M/S\s+(.+?)(?:\s|$)',
r'(.+?)\s+(?:LTD|
]
# Patterns for company names (Alt #4: Structured formats)
self.structured_patterns = [
r'name\s+of\s+company\s*:\s*(.
r'employer\s*:\s*(.+?)(?:\n|$)
r'organization\s*:\s*(.+?)(?:\
r'company\s*:\s*(.+?)(?:\n|$)'
]
# Patterns for irrelevant information to filter
self.irrelevant_patterns = [
r'marital\s+status\s*:\s*.+?(?
r'hobbies\s*:\s*.+?(?:\n|$)',
r'extra\s*-?\s*curricular\s*
r'references\s*:\s*.+?(?:\n|$)
r'nationality\s*:\s*.+?(?:\n|$
r'seminars?\s*(?:performed|
r'training\s*programs?\s*
r'courses?\s*attended\s*:\
r'conferences?\s*attended\s*:\
r'(?:ssc|hsc|puc)\s*:\s*.+?(?:
r'scholarships?\s*:\s*.+?(?:\
r'sports?\s*(?:manship)?\s*:\
r'competitions?\s*(?:
r'(?:career|professional)\s+
r'place\s+of\s+birth\s*:\s*.+?
r'(?:height|weight|
r'passport\s+no\.?\s*:\s*.+?(?
r'visas?\s*:\s*.+?(?:\n|$)',
r'(?:father\'s\s+name|family\
r'(?:scout\s+master|head\s+
r'(?:dramatics|public\s+
r'religious\s+affiliations?\s*
r'social\s+affiliations?\s*:\
r'copies\s+of\s*(?:
]
# Patterns for educational qualifications
self.edu_patterns = [
r'(?:(?:bachelor|master|
r'(?:ssc|sslc|hsc|puc\s*-?\s*(
r'ph\.?d\.?\s*\(?\s*([^\)]+?)\
]
# Equivalence dictionary for standardizing qualifications
self.edu_equivalence = {
'b.e. (m)': 'B.E. (Mechanical Engineering)',
'b.e. (mech.)': 'B.E. (Mechanical Engineering)',
'b.e. (mech. eng.)': 'B.E. (Mechanical Engineering)',
'b.e. (mechanical engineering)': 'B.E. (Mechanical Engineering)',
'bachelor of engineering - mechanical': 'B.E. (Mechanical Engineering)',
'b.e. (elect.)': 'B.E. (Electrical Engineering)',
'b.e. (electrical)': 'B.E. (Electrical Engineering)',
'b.sc.': 'B.Sc.',
'bachelor of science': 'B.Sc.',
'm.b.a.': 'M.B.A.',
'master of business administration': 'M.B.A.',
# Add more equivalences as needed
}
# Patterns for experience headings (Annex A)
self.experience_headings = [
r'work\s+experience',
r'professional\s+experience',
r'employment\s+history',
r'career\s+summary',
r'job\s+history',
r'experience',
r'work\s+history',
r'professional\s+background',
r'employment\s+details',
r'career\s+details',
r'work\s+profile',
r'professional\s+profile',
r'job\s+experience',
r'career\s+experience',
r'employment\s+record',
r'work\s+record',
r'professional\s+record',
r'job\s+profile',
r'career\s+profile',
r'work\s+details',
r'professional\s+
r'employment\s+summary',
r'job\s+summary',
r'career\s+record',
r'work\s+summary',
r'professional\s+summary',
r'employment\s+profile',
r'job\s+details',
r'career\s+history'
]
# Placeholder for experience sub-topics (Annex B)
self.experience_subtopics = {
'responsibilities': set(),
'achievements': set(),
'projects': set(),
'skills': set(),
'roles': set(),
'technologies': set(),
'management': set(),
'leadership': set(),
'teamwork': set(),
'training': set(),
'certifications': set(),
'awards': set(),
'clients': set(),
'industries': set(),
'tools': set(),
'processes': set()
# Populate with specific words as directories are created
}
# Patterns for address
self.address_patterns = [
r'(?:(?:address|resi\.?\s+
r'(.+?)(?:(?:\bIndia\b|\b\d{6}
]
def load_company_master(self, file_path: str) -> None:
"""Load company names from a master directory file."""
try:
with open(file_path, 'r') as f:
self.company_master = {line.strip().lower() for line in f if line.strip()}
except FileNotFoundError:
print(f"Company master file {file_path} not found.")
def clean_company_name(self, name: str) -> str:
"""Clean extracted company name."""
name = name.strip()
name = re.sub(r'\s+as\s+.+$', '', name, flags=re.I)
name = re.sub(r'\s+in\s+.+$', '', name, flags=re.I)
name = re.sub(r'[.,;]+$', '', name)
return name.strip()
def extract_company_structured(
"""Extract company names from structured formats (Alt #4)."""
companies = []
for pattern in self.structured_patterns:
matches = re.finditer(pattern, text, re.I | re.M)
for match in matches:
company = self.clean_company_name(match.
if company:
companies.append(company)
return companies
def extract_company_preceding(
"""Extract company names using preceding phrases (Alt #2)."""
companies = []
for pattern in self.preceding_patterns:
matches = re.finditer(pattern, text, re.I)
for match in matches:
company = self.clean_company_name(match.
if company:
companies.append(company)
return companies
def extract_company_tell_tale(
"""Extract company names using tell-tale signs (Alt #3)."""
companies = []
for pattern in self.tell_tale_patterns:
matches = re.finditer(pattern, text, re.I)
for match in matches:
company = self.clean_company_name(match.
if company:
companies.append(company)
return companies
def extract_company_master(self, text: str) -> List[str]:
"""Extract company names using master directory (Alt #1)."""
companies = []
if not self.company_master:
return companies
text_lower = text.lower()
for company in self.company_master:
if company in text_lower:
companies.append(company)
return companies
def extract_companies(self, text: str) -> List[str]:
"""Extract company names using all alternatives in specified order."""
companies = []
companies.extend(self.extract_
companies.extend(self.extract_
companies.extend(self.extract_
companies.extend(self.extract_
seen = set()
unique_companies = [c for c in companies if not (c.lower() in seen or seen.add(c.lower()))]
return unique_companies
def filter_irrelevant(self, text: str) -> Dict[str, List[str]]:
"""Identify and store irrelevant information, return filtered text."""
irrelevant_data = {}
filtered_text = text
for pattern in self.irrelevant_patterns:
matches = re.finditer(pattern, text, re.I | re.M)
key = pattern.split('\s*')[0].lower(
irrelevant_data[key] = [match.group(0).strip() for match in matches]
filtered_text = re.sub(pattern, '', filtered_text, flags=re.I | re.M)
return irrelevant_data, filtered_text
def standardize_qualification(
"""Standardize educational qualification using equivalence dictionary."""
qual_lower = qual.lower().strip()
return self.edu_equivalence.get(qual_
def extract_education(self, text: str) -> List[Dict[str, str]]:
"""Extract educational qualifications with thesis topics for Ph.D."""
qualifications = []
for pattern in self.edu_patterns:
matches = re.finditer(pattern, text, re.I | re.M)
for match in matches:
qual = match.group(0).strip()
thesis = None
if 'ph.d' in qual.lower():
thesis = match.group(1).strip() if len(match.groups()) > 1 else None
standardized_qual = self.standardize_
if not re.search(r'ssc|sslc|hsc|puc', standardized_qual, re.I): # Exclude under-graduate
qualifications.append({
'qualification': standardized_qual,
'thesis': thesis
})
return qualifications
def extract_experience(self, text: str) -> List[Dict[str, str]]:
"""Extract experience sections based on headings and sub-topics."""
experiences = []
for heading in self.experience_headings:
pattern = rf'(?:{heading}\s*:\s*)(.+?)(?
matches = re.finditer(pattern, text, re.I | re.M | re.S)
for match in matches:
exp_text = match.group(1).strip()
exp_details = {'heading': heading, 'details': exp_text, 'subtopics': {}}
for subtopic in self.experience_subtopics:
# Placeholder: Match subtopic keywords when directories are available
exp_details['subtopics'][
experiences.append(exp_
return experiences
def extract_address(self, text: str) -> List[str]:
"""Extract addresses based on last word criteria."""
addresses = []
for pattern in self.address_patterns:
matches = re.finditer(pattern, text, re.I | re.M)
for match in matches:
address = match.group(1).strip()
if address:
addresses.append(
return addresses
def parse_resume(self, text: str) -> Dict[str, any]:
"""Parse resume for all required fields."""
# Filter irrelevant information
irrelevant_data, filtered_text = self.filter_irrelevant(text)
# Extract fields
result = {
'companies': self.extract_companies(
'education': self.extract_education(
'experience': self.extract_experience(
'address': self.extract_address(filtered_
'irrelevant_data': irrelevant_data # Stored for internal use
}
return result
def main():
# Example usage
sample_resume = """
Name: John Doe
Address: 123 Main Street, Mumbai, India
Employer: ABC Pvt Ltd
Education:
- B.E. (Mech)
- Ph.D (Robotics)
Work Experience:
- Joined XYZ Limited as Manager in 2015
- Presently working with M/S DEF Enterprises as Senior Consultant
Marital Status: Married
Hobbies: Reading, Chess
"""
parser = ResumeParser(company_master_
parsed_data = parser.parse_resume(sample_
print("Parsed Resume Data:")
print("Companies:", parsed_data['companies'])
print("Education:", parsed_data['education'])
print("Experience:", parsed_data['experience'])
print("Address:", parsed_data['address'])
print("Irrelevant Data (Stored Internally):", parsed_data['irrelevant_data']
if __name__ == "__main__":
main()
Show in sidebar
Notes on Implementation
- Irrelevant Information: The filter_irrelevant method captures items like marital status and hobbies, storing them in a dictionary for internal use (e.g., for references or nationality in global placements) but excludes them from the client-facing output.
- Educational Qualifications: The code uses a regex-based approach to capture qualifications and standardizes them using an equivalence dictionary. It excludes under-graduate qualifications (SSC, HSC, PUC) and retains Ph.D. thesis topics, as per your instructions.
- Experience: The experience extraction identifies sections using the 29 headings and sets up a structure for sub-topic analysis. Since the sub-topic word directories are pending, the code includes a placeholder for future integration.
- Address: The address extraction follows your logic, focusing on patterns ending with "India" or a PIN code, and captures preceding labels like "Permanent Address."
- Extensibility: The parse_resume method integrates all extractions into a single output, making it easy to add more fields (e.g., Name, Date of Birth) when you provide their logic.
Next Steps
Please share the logic for extracting other fields like Name of Executive, Date of Birth/Age, or Years of Service, and I can further update the parser. Additionally, if you complete the directories for the 16 experience sub-topics or have the computer print-out of qualification codes, I can refine the code to incorporate them. Let me know how you'd like to proceed!
No comments:
Post a Comment