Proper - presentation

Goal

The goal is to showcase different skills by sharing links to existing open source projects. The links refer to specific parts of different projects, so that it fits the criteria. The #1 criteria is to showcase work that fits Proper's upcoming Data Science position - and its current challenges in order to get there. Thus, it was divided into Data Engineering and Data Science projects. There is also a Software Engineering project, developed over a period of 3 years, to be shown but the repositories are private at the moment.

A part of a successful data science position is the ability to explain terms. Some terms might not be known but the session will include explanations.

Data Engineering

A standard data engineering pipeline is normally divided into three parts: Extraction, Transformation and Load. Below, I provide example that partially (since this is an aggregation of different opensource projects) fit into an ELT pipeline.

Data Extraction

Example 1

Context: The goal of this project was to extract data from jobindex.dk, and store it in a Data Warehouse (running PostgreSQL, thus including the Data Load part from an ETL pipeline). It runs a Flask API.

Link:

pmadruga/jobindex-scraper
Contribute to pmadruga/jobindex-scraper development by creating an account on GitHub.
https://github.com/pmadruga/jobindex-scraper

Data Transformation

Example 1

Context: Inside process_text.py, I'm pre-processing Danish text using several different known techniques (like Stemming and Lemmatization). This example is relevant for cases where the data has several different formats in the original data set but also to properly handle text in Danish.

Link:

pmadruga/ds-jobindex
Machine learning techniques applied to the jobindex.dk dataset - pmadruga/ds-jobindex
https://github.com/pmadruga/ds-jobindex/tree/master/src/features

Containerization

Example 1

Context: This is a simple docker-compose that deploys a Ghost CMS blog with a MySQL DB into a raspberry pi. I use it for docker swarm.

pmadruga/ghost-raspberry-docker-compose
Contribute to pmadruga/ghost-raspberry-docker-compose development by creating an account on GitHub.
https://github.com/pmadruga/ghost-raspberry-docker-compose/blob/master/docker-compose.yml

Data Science

Data Analysis

Example 1

Context: DCB is a project dedicated to analyse Danish data. Here, an exploratory data analysis is made on a dataset. Exploratory analysis are at the core of understanding a dataset before any changes are performed on it. In the case below, it's possible to see that the dataset has a flaw (missing data, in this case).

Link:

pmadruga/dcb
Contribute to pmadruga/dcb development by creating an account on GitHub.
https://github.com/pmadruga/dcb/blob/master/notebooks/jobindex%20-%20exploratory%20data%20analysis.ipynb

Visualization

Example 1

Context: The goal of this project is to analyze different professions on a dataset with 4million job postings, by quarter.

Link:

pmadruga/dcb
Contribute to pmadruga/dcb development by creating an account on GitHub.
https://github.com/pmadruga/dcb/blob/master/notebooks/Job%20analysis%20by%20quarter.ipynb

Example 2

Context: This is part of a machine learning project regarding Heart Rate Variability. The purpose is to show different types of data visualization to convey information in a simple way. For example, inside "Variable Correlation" there's a heatmap correlation grid which easily conveys how two variables correlate with each other.

Link:

https://github.com/pmadruga/ml_project/blob/master/report/report.md

Natural Language Processing

Example 1

Context: Here you'll find how to create embeddings (or in other words, a spacial representation) of words using BERT algorithm, TF-IDF, and via Cosine Similarity.

Link:

pmadruga/ds-jobindex
Machine learning techniques applied to the jobindex.dk dataset - pmadruga/ds-jobindex
https://github.com/pmadruga/ds-jobindex/blob/master/src/features/build_features.py

Context: This is the final blogpost regarding the application of different techniques to determine word similarity. It's the final part of a 4-month long side project.

Link:

Deep Learning: Applying Google's Latest Search algorithm on 4.2million Danish job postings
AbstractSearch engine results are a common challenge, especially for non-English languages. In this project, different models - including Google Search's underlying technique (BERT) - are applied to improve the search engine of Denmark's most significant job portal: jobindex.dk. The goal is to determine whether modern Artificial Intelligence (AI) and
https://johnconnor.ai/deep-learning-applying-googles-latest-search-algorithm-on-biggest-danish-job-site/