My work experience

My work experience - details of my previous projects

WORK IN PROGRESS ! ! ! (pun intended)

Here you can find a brief summary of my earlier projects. There is a detailed description of all the challenges that arose, how I tried to solve those challenges, my interaction with other stakeholders, and the key insights and lessons I learned.

This section aims at people thinking of hiring/working with me, therefore I tried to enumerate here the necessary information to get a glimpse of my current state of career and to give the possibility to dig down further only the exciting articles (which are not present at the moment).

All of the work listed here was done by myself. If I had a partial contribution, I marked it accordingly.

Each summary follows the same format.

Missing:
1. Conferences and mentoring
2. Ongoing projects

Listing site - Kubeflow on VertexAI (2023)

  • Problem definition: Rewriting the ML pipeline into Kubeflow 
  • Target/stakeholders: data team leader
  • Summary: The ML API I wrote earlier needed a more sustainable pipeline. Before, it worked using two distinct containers, operating training, building, testing, and inferencing tasks. I merged the two repos into one and used KubeFlow to manage the pipeline steps and provide additional supervision to the data team leader. The functions written in R needed to be called through ContainerOp. 
  • Results:
    One repo instead of two
    A Kubeflow pipeline for all major steps
  • Used/learned tools: Kubeflow, Vertex AI, Python
  • Used/learned skills: cloud architecting

Listing site - deep learning embeddings using search and clickstream data (2022)

  • Problem definition: Pioneering project for creating embeddings for most of the listings, that could be useful for further modeling (distance should be meaningful in different ways)
  • Target/stakeholders: data science team
  • Summary: This pioneering project aimed to demonstrate the potential and usefulness of embedding vectors in various domains. Since the complete clickstream and search data were available for more than 4 years, having this kind of embedding opened a whole set of possibilities. These included similarity-based price prediction, later visitor clustering, real-time recommendations, and real-time webpage customization.

    This work was partly a reproduction of the following project at Airbnb:
    Listing-embeddings-for-similar-listing-recommendations-and-real-time-personalization-in-search

    The training database was extracted from BigQuery using PySpark and Polars (Triplets for the triplet loss). The initial database size was about 2 terabytes, while the base for the triplet generation was around 1 gigabyte. The embeddings were then created with TensorFlow using the generated triplets, stored with Annoy, and analyzed with Plotly Express.

    My vision was to build a whole project team to utilize the power of embeddings in all areas of the listing site. However, I received a Green Card for the US and decided to quit and move to Seattle.
  • Results:
    Train dataset for triplet loss (and a pipeline behind it)
    The first generation of embeddings
  • Used/learned tools: Python (Programming language), BigQuery, Pyspark, Polars, Tensorflow,  Annoy, Plotly Express
  • Used/learned skills: reading/implementing papers and articles
 

Listing site - duplicate listing detection with Locality Sensitive Hashing (2021)

  • Problem definition: Deduplicate the database based on textual descriptions 
  • Target/stakeholders: Buyers/sellers of the market, Market analysts of the company
  • Summary: Duplicate listings hurt both the training process (overfitting) and the validation process (unreal validation loss), and made the work of the market analyst harder. Multiple agents were allowed to list the same property on the site, even with slightly altered data. Deduplication was a key challenge, and this project aimed to help this using textual descriptions, using LSH. LSH was needed to avoid the N*N computation burden. After excluding a confounder case (when a new building project is listed on the site by each flat), the accuracy was almost 100%. 
    I built a complete, production-ready pipeline that could cluster the listings based on the similarity - all duplicates in the same cluster - and recalculate only the new descriptions. 
  • Results: 
    πŸ”§ Production-ready clustering pipeline
  • Used/learned tools: R packages (Reticulate, LSHR), Python (Programming language)
  • Used/learned skills: Calling Python from R
 

Listing site - extracting data from listing descriptions(NLP) (2021)

  • Problem definition: The listing descriptions contain important information that could be useful in the modeling. They tend to be more reliable than the parameters selected during the creation of the listing. 
  • Target/stakeholders: Buyers/sellers of the market, Market analysts of the company
  • Summary: Some of the variables defined as obligatory in the modeling caused significant data loss because the user did not specify them in the listing. (like the condition of the apartment) However, usually, this information could be extracted from the description text. This pioneering project was to open NLP capabilities in the company. Running before the LLM revolution, an iterative and active labeling pipeline was used, with fine-tuning an existing Bert model. Although the results were promising, the management decided to focus our efforts on other areas. After the LLM revolution, I used prompt engineering/fine-tuning for the same task - please find the according part of my experience list. Having an acceptable model is only one part of the job, implementing a production-level pipeline that fits the existing ecosystem is much more difficult.
  • Results: 
    Active learning pipeline using Python, Huggingface, and Labelstudio
    Modified Named Entity Recognition (NER) for categorizing large parts of the descriptions
    Text classification for extracting the condition variable of the apartment
  • Used/learned tools: Python (programming language), Huggingface Transformers, Labelstudio
  • Used/learned skills: different types of NLP applications 

Property listing value estimator model - Advanced modeling (2020-2022)

  • Problem definition: The first production model was limited in capabilities in terms of location and apartment type. Also, even more careful filtering methods and more sophisticated models were proposed. In close connection with the business stakeholders, areas where the targeted average precision was not achieved, needed separate treatment.
  • Target/stakeholders: Business leaders of the company, buyers/sellers of the market
  • Summary: One key aspect of the modeling was to define the areas where the sought precision could be reached. For this purpose, I designed an interactive dashboard with which a business stakeholder could define the parameters that separated the acceptable / non-acceptable areas based on the previous 12 months. Categorical embeddings from deep learning models were tested and implemented. The smoothness of the predictions was improved. The calibration of the model was analyzed and improved. A plain deep learning model (in Keras) was tested. Weighted model for a better Median Absolute Percentage Error loss. Kriging and weighted neighbor models using whole listing embeddings were considered.
  • Results: 
    🚦An interactive planning tool, written in Shiny
    🚝 Weighted, calibrated, smooth model. Still XGBoost, written mainly in R.
    🏀 Better training dataset filtering

  • Used/learned tools: R (programming language), Python (programming language), Tensorflow, Keras, R packages (Shiny, data.table)
  • Used/learned skills: reading/implementing papers and articles
 

Listing site - spatial project (2020-2022)

  • Problem definition: For the 20 years of existence, the listing site had different approaches when it comes to spatial data. The goal was to get the most out of the available data, clean up the databases, eliminate the inconsistencies between physical addresses and lon/lat coordinates and enable future analysis and modeling work
  • Target/stakeholders: Data team, cartography team, front and back office software teams
  • Summary: The in-house spatial hierarchy was flawed (there were zone levels between the city level and the street level, sometimes without a clear parent-child relationship), a more consistent version was proposed and used in data science modeling. Most of the zones or neighborhoods existed only as names, without clear boundaries. Some of the key information on the exact location was missing or inconsistent - that was mitigated through (reverse) geocoding. The geocoding cache contained imprecise data too.  All of these were obstacles to giving a better service to our customers and in the modeling.
  • Results: 
    πŸ—Ί A concise location hierarchy where the logical units have the same level.
    πŸ“ Tens of thousands of training examples got restored to their exact locations.
    🌍 Existing spatial inconstancies were discovered and visualized on interactive maps for other team leaders.
    πŸƒThe smallest enclosing region/zone was determined by coordinates and boundaries, causing an acceptable delay in real-time prediction. (20ms)
  • Used/learned tools: R packages (google S2, Single Features), BigQuery Geography functions, Leaflet for interactive mapping, Kriging, Spatial relationships, Spatial regression
  • Used/learned skills: cross-function project work
 

Property listing value estimator model - Production pipeline (2020-2022)

  • Problem definition: After the MVP, a production-ready pipeline was needed, including the training, testing, and serving code. 
  • Target/stakeholders: Product manager
  • Summary: The previously existing code was subdivided into logical parts, and every parameter was passed by YAML files. The train and inference applications were dockerized with automated build tests. Running on Google Compute Engine virtual machines, all the logs were sent to Google Cloud Logging in JSON format. Monitoring and live alerting into Slack channel. Deploying to Google Cloud Run to autoscale.  
  • Results:
    🚚 Running pipeline in production. 
    πŸ”₯ Prediction intervals
    for XGBoost
    πŸ“ˆ Weekly estimation for the previous two years
  • Used/learned tools: YAML, Gitlab, CI/CD, Plumber (R package)
  • Used/learned skills: conformal regression
 

Property listing value estimator model - MVP (2019)

  • Problem definition: Investigating the precision of real estate property estimation through analysis of listing data (like the Zestimate of Zillow). Leveraging the spatial and temporal nature of the data. First Data Science task at the company, with a predecessor project that caused unreal expectations about the performance due to the bad train/test separation. 
  • Target/stakeholders: Product managers (in-house), real estate agents, buyers and sellers of the market
  • Summary: This was the first step of several - to 
  • Results: 
    πŸ’ͺ A working Xgboost model for one subcategory of apartments, in Budapest. 
    πŸ”† An interactive Shiny application for testing the model and seeing the prediction differences altering the input parameters and the location on the map.
    πŸ’Ή A daily prediction process to check the performance of the different model versions.
  • Used/learned tools: R (programming language), PostgreSQL, BigQuery, XGBoost, Keras. Spatial libraries (Single Features, S2), Google Cloud Platform, Linux server, Python (Programming language), Shiny
  • Used/learned skills: effective communication with different teams, advising about data science without taking the leadership from the project manager, importance of parametrizable containerized code, interactive programming, data cleansing, hedonic regression
 

Bestcoupons.it - a Groupon-clone site in Italy (2011-2012)

  • Problem definition: Founder and manager of a Groupon-clone website in Italy; needed to establish the venture both legally and informatically.  
  • Target/stakeholders: Me as a founder, local businesses and people as customers
  • Summary: Founder and manager of a Groupon-clone website in Italy; needed to establish the venture both legally and informatically.  
  • Results:
    Business was set up, and legal documents (like terms and conditions) were ready.  
    ✅ Site was functional (code was finalized, design was ready, and remote infrastructure was operational)
    ✅ Marketing material was designed and printed
  • Used/learned tools: Ruby on Rails, Html, CSS, Linux server, GIMP
  • Used/learned skills: working in a foreign environment, website management, and planning. 
 

Building a training database from 150k+ Excel files (2006)

  • Problem definition: A major Hungarian bank wanted to develop a model and a tool that helps appraisers estimate the worth of real estate properties. For the first step, building a (tabular) training dataset was needed for the existing 150k+ Excel files. The Excel files lacked versioning and had 'creative' selection methods for categorical variables.
  • Target/stakeholders: Management of the Mortgage Bank, appraisers
  • Summary: 
  • Results:
    ✅ a training database in 
    CSV file format, containing all the extracted and cleaned information
    😟 company knowledge about the mishandled Excel schemas. 
  • Used/learned tools: Excel, SPSS
  • Used/learned skills: data mining, data discovery, linear models, communicating with different levels of stakeholders
Link:



Popular posts from this blog

My work experience (pinned post)

Unleashing Creativity: Recap of the AI Hackathon at Microsoft Reactor organized by the Seattle chapter of AICamp

Capturing Key Insights: The Power of Documentation