We are seeking a Senior Data Scientist to join our team in Portugal. This is a remote position, meaning you can work from anywhere within the country.
About the Role
This is a challenging opportunity to work with cutting-edge tools and technologies in the field of Generative AI. As a Senior Data Scientist, you will be responsible for developing and refining strategies for using Large Language Models (LLMs) to extract, summarize, and transform unstructured content into structured formats.
* Multimodal Extraction: Apply state-of-the-art tools (OCR, vision-language models, document understanding frameworks) to interpret diverse input types;
* Prompt Engineering: Develop and refine strategies for using LLMs to extract, summarize, and transform unstructured content into structured formats;
* Data Quality & Structuring: Clean, validate, and transform messy, unstructured data into well-defined schemas ready for use in training or analytics pipelines;
* Content Filtering: Define standards and build systems for cleaning, validating, and filtering data to ensure accuracy, reduce bias, and align with ethical/safety guidelines;
* Human-in-the-Loop Feedback: Design feedback loops where experts validate or enrich data, improving LLM-based extraction reliability;
* Scalability & Optimization: Architect cost-efficient, high-throughput data pipelines that are robust to noisy or incomplete sources;
* Research & Prototyping: Experiment with emerging tools and methods in the LLM + multimodal space, exploring new ways to enhance information coverage and extraction reliability;
* Collaboration: Partner with data engineers and other data scientists to integrate collected data into larger AI and analytics systems;
Main Requirements
* Master's degree (or Ph D) in Computer Science, Data Science, Machine Learning, Statistics, or a related field;
* Proficiency in Python and experience with libraries for web scraping, OCR (e.g., Tesseract, Easy OCR), and NLP (e.g., Hugging Face Transformers);
* Deep understanding of LLM capabilities in multimodal and extraction contexts, including prompt engineering and few-shot learning;
* Strong background in unstructured data processing: APIs, web scraping, HTML parsing, OCR, image/document analysis;
* Strong analytical problem-solving skills, with a track record of turning noisy data into high-quality datasets for ML;
* Excellent communication and documentation skills, with the ability to influence across technical and product teams.