My work experience (pinned post)

 

Details of my previous projects

Here you can find a brief summary of my earlier projects. There is a detailed description of all the challenges that arose, how I tried to solve those challenges, my interaction with other stakeholders, and the key insights and lessons I learned.

WORK IN PROGRESS ! ! ! (pun intended)

This section aims at people thinking of hiring/working with me, therefore I tried to enumerate here the necessary information to get a glimpse of my current state of career and to give the possibility to dig down further only the exciting articles. 

All of the work listed here was done by myself. If I had a partial contribution, I marked it accordingly.

Each summary follows the same format.

Missing:
1. Conferences and mentoring
2. Some of my ongoing projects

Listing site - Kubeflow / Vertex AI pipeline (2023)

  • Problem definition: Rewriting the existing pipeline in Kubeflow to enhance maintainability
  • Target/stakeholders: data team leader, C-level management
  • Summary: The company lost its data science team, and the running pipelines were without supervision. Although the existing system had all the necessary elements (a notable exception was the data validation step to detect upstream changes and data drift), the system was based on virtual machines that needed supervision. This project aimed to lessen this burden by utilizing Kubeflow pipelines on the Google Cloud Platform (Vertex AI pipelines). This was also a major step to rewrite the code into SQL and Python (from R)
  • Results:
    As a first step, the complete (mainly R) codebase was analyzed, and the complete DAG of the steps was remodeled. As the data team has a functional data engineer, every possible step, that was not clearly necessary in inference time was moved to the training dataset preprocessing part, later rewritten in SQL (partly done by the data team). Fortunately, BigQuery has rich functionality (including spatial functions).

    Being a working pipeline in production, the remodeled steps were wrapped in Container Components until rewriting in native Python code. 

    The execution is managed by Vertex AI, and not by custom VM-s.

    Data ingestion was shrunk to 10%, as BigQuery handles most preprocessing.
    The use of ParalellFor made the pipeline run much faster even with the overhead of the Vertex AI. 
    This solution is cost-efficient as less maintenance is needed. 
    The data team can read the status dashboard without specific data science knowledge.

  • Used/learned tools: Python (Programming language), BigQuery, R, Vertex AI Pipelines, Kubeflow Pipelines, bash, SQL
  • Used/learned skills: pipeline design


Listing site - deep learning embeddings using search and clickstream data (2022)

  • Problem definition: Pioneering project for creating embeddings for most of the listings, that could be useful for further modeling (distance should be meaningful in different ways)
  • Target/stakeholders: data science team
  • Summary: This pioneering project aimed to demonstrate the possibility and usefulness of embedding vectors in several domains. As the complete clickstream and search data was available for more than 4 years, having this kind of embedding would open a whole set of possibilities, like similarity-based price prediction, and later visitor clusteringreal-time recommendations, and real-time webpage customization.
    Partly the reproduction of the following work at Airbnb:
    Listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search
    The training database was extracted from Bigquery with Pyspark and Polars (Triplets for the triplet loss). The initial database was about 2 terabytes, while the base of the triplet generation was around 1 Gigabyte. The embeddings were then created with Tensorflow with generated triplets, stored with Annoy, and analyzed with Plotly Express. 
    My vision was to build a whole project team to utilize the power of embeddings in all areas of the listing site, but I got a Green Card to the US, and I quit and moved to Seattle.
  • Results:
    Train dataset for triplet loss (and a pipeline behind it)
    The first generation of embeddings
  • Used/learned tools: Python (Programming language), BigQuery, Pyspark, Polars, Tensorflow,  Annoy, Plotly Express
  • Used/learned skills: reading/implementing papers and articles
 

Listing site - duplicate listing detection with Locality Sensitive Hashing (2021)

  • Problem definition: Deduplicate the database based on textual descriptions 
  • Target/stakeholders: Buyers/sellers of the market, Market analysts of the company
  • Summary: Duplicate listings hurt both the training process (overfitting) and the validation process (unreal validation loss), and made the work of the market analyst harder. Multiple agents were allowed to list the same property on the site, even with slightly altered data. Deduplication was a key challenge, and this project aimed to help this using textual descriptions, using LSH. LSH was needed to avoid the N*N computation burden. After excluding a confounder case (when a new building project is listed on the site by each flat), the accuracy was almost 100%. 
    I built a complete, production-ready pipeline that could cluster the listings based on the similarity - all duplicates in the same cluster - and recalculate only the new descriptions. 
  • Results: 
    πŸ”§ Production-ready clustering pipeline
  • Used/learned tools: R packages (Reticulate, LSHR), Python (Programming language)
  • Used/learned skills: Calling Python from R
 

Listing site - extracting data from listing descriptions(NLP) (2021)

  • Problem definition: The listing descriptions contain important information that could be useful in the modeling. They tend to be more reliable than the parameters selected during the creation of the listing. 
  • Target/stakeholders: Buyers/sellers of the market, Market analysts of the company
  • Summary: Some of the variables defined as obligatory in the modeling caused significant data loss because the user did not specify them in the listing. (like the condition of the apartment) However, usually, this information could be extracted from the description text. This pioneering project was to open NLP capabilities in the company. Running before the LLM revolution, an iterative and active labeling pipeline was used, with fine-tuning an existing Bert model. Although the results were promising, the management decided to focus our efforts on other areas. After the LLM revolution, I used prompt engineering/fine-tuning for the same task - please find the according part of my experience list. Having an acceptable model is only one part of the job, implementing a production-level pipeline that fits the existing ecosystem is much more difficult.
  • Results: 
    Active learning pipeline using Python, Huggingface, and Labelstudio
    Modified Named Entity Recognition (NER) for categorizing large parts of the descriptions
    Text classification for extracting the condition variable of the apartment
  • Used/learned tools: Python (programming language), Huggingface Transformers, Labelstudio
  • Used/learned skills: different types of NLP applications 

Property listing value estimator model - Advanced modeling (2020-2022)

  • Problem definition: The first production model was limited in capabilities in terms of location and apartment type. Also, even more careful filtering methods and more sophisticated models were proposed. In close connection with the business stakeholders, areas where the targeted average precision was not achieved, needed separate treatment.
  • Target/stakeholders: Business leaders of the company, buyers/sellers of the market
  • Summary: One key aspect of the modeling was to define the areas where the sought precision could be reached. For this purpose, I designed an interactive dashboard with which a business stakeholder could define the parameters that separated the acceptable / non-acceptable areas based on the previous 12 months. Categorical embeddings from deep learning models were tested and implemented. The smoothness of the predictions was improved. The calibration of the model was analyzed and improved. A plain deep learning model (in Keras) was tested. Weighted model for a better Median Absolute Percentage Error loss. Kriging and weighted neighbor models using whole listing embeddings were considered.
  • Results: 
    🚦An interactive planning tool, written in Shiny
    🚝 Weighted, calibrated, smooth model. Still XGBoost, written mainly in R.
    🏀 Better training dataset filtering

  • Used/learned tools: R (programming language), Python (programming language), TensorFlow, Keras, R packages (Shiny, data.table)
  • Used/learned skills: reading/implementing papers and articles
 

Listing site - spatial project (2020-2022)

  • Problem definition: For the 20 years of existence, the listing site had different approaches when it comes to spatial data. The goal was to get the most out of the available data, clean up the databases, eliminate the inconsistencies between physical addresses and lon/lat coordinates, and enable future analysis and modeling work
  • Target/stakeholders: Data team, cartography team, front and back office software teams
  • Summary: The in-house spatial hierarchy was flawed (there were zone levels between the city level and the street level, sometimes without a clear parent-child relationship), a more consistent version was proposed and used in data science modeling. Most of the zones or neighborhoods existed only as names, without clear boundaries. Some of the key information on the exact location was missing or inconsistent - that was mitigated through (reverse) geocoding. The geocoding cache contained imprecise data too.  All of these were obstacles to giving a better service to our customers and in the modeling.
  • Results: 
    πŸ—Ί A concise location hierarchy where the logical units have the same level.
    πŸ“ Tens of thousands of training examples were restored to their exact locations.
    🌍 Existing spatial inconsistencies were discovered and visualized on interactive maps for other team leaders.
    πŸƒThe smallest enclosing region/zone was determined by coordinates and boundaries, causing an acceptable delay in real-time prediction. (20ms)
  • Used/learned tools: R packages (google S2, Single Features), BigQuery Geography functions, Leaflet for interactive mapping, Kriging, Spatial relationships, Spatial regression
  • Used/learned skills: cross-function project work
 

Property listing value estimator model - Production pipeline (2020-2022)

  • Problem definition: After the MVP, a production-ready pipeline was needed, including the training, testing, and serving code. 
  • Target/stakeholders: Product manager
  • Summary: The previously existing code was subdivided into logical parts, and every parameter was passed by YAML files. The train and inference applications were dockerized with automated build tests. Running on Google Compute Engine virtual machines, all the logs were sent to Google Cloud Logging in JSON format. Monitoring and live alerting into Slack channel. Deploying to Google Cloud Run to autoscale.  
  • Results:
    🚚 Running pipeline in production. 
    πŸ”₯ Prediction intervals 
    for XGBoost
    πŸ“ˆ Weekly estimation for the previous two years
  • Used/learned tools: YAML, Gitlab, CI/CD, Plumber (R package)
  • Used/learned skills: conformal regression
 

Property listing value estimator model - MVP (2019)

  • Problem definition: Investigating the precision of real estate property estimation through analysis of listing data (like the Zestimate of Zillow). Leveraging the spatial and temporal nature of the data. First Data Science task at the company, with a predecessor project that caused unreal expectations about the performance due to the bad train/test separation. 
  • Target/stakeholders: Product managers (in-house), real estate agents, buyers and sellers of the market
  • Summary: This was the first step of several - to 
  • Results: 
    πŸ’ͺ A working Xgboost model for one subcategory of apartments, in Budapest. 
    πŸ”† An interactive Shiny application for testing the model and seeing the prediction differences altering the input parameters and the location on the map.
    πŸ’Ή A daily prediction process to check the performance of the different model versions.
  • Used/learned tools: R (programming language), PostgreSQL, BigQuery, XGBoost, Keras. Spatial libraries (Single Features, S2), Google Cloud Platform, Linux server, Python (Programming language), Shiny
  • Used/learned skills: effective communication with different teams, advising about data science without taking the leadership from the project manager, importance of parametrizable containerized code, interactive programming, data cleansing, hedonic regression
 

Bestcoupons.it - a Groupon-clone site in Italy (2011-2012)

  • Problem definition: Founder and manager of a Groupon-clone website in Italy; needed to establish the venture both legally and informatically.  
  • Target/stakeholders: Me as a founder, local businesses and people as customers
  • Summary: Founder and manager of a Groupon-clone website in Italy; needed to establish the venture both legally and informatically.  
  • Results:
    ✅ 
    Business was set up, and legal documents (like terms and conditions) were ready.  
    ✅ Site was functional (code was finalized, design was ready, and remote infrastructure was operational)
    ✅ Marketing material was designed and printed
  • Used/learned tools: Ruby on Rails, Html, CSS, Linux server, GIMP
  • Used/learned skills: working in a foreign environment, website management, and planning. 
 

Building a training database from 150k+ Excel files (2006)

  • Problem definition: A major Hungarian bank wanted to develop a model and a tool that helps appraisers estimate the worth of real estate properties. For the first step, building a (tabular) training dataset was needed for the existing 150k+ Excel files. The Excel files lacked versioning and had 'creative' selection methods for categorical variables.
  • Target/stakeholders: Management of the Mortgage Bank, appraisers
  • Summary: 
  • Results:
    ✅ a training database in 
    CSV file format, containing all the extracted and cleaned information
    😟 company knowledge about the mishandled Excel schemas. 
  • Used/learned tools: Excel, SPSS
  • Used/learned skills: data mining, data discovery, linear models, communicating with different levels of stakeholders
Link:



Comments

Popular posts from this blog

Unleashing Creativity: Recap of the AI Hackathon at Microsoft Reactor organized by the Seattle chapter of AICamp

Capturing Key Insights: The Power of Documentation