My first data analyst job
Previous work experience : estimating housing prices
To place this experience in time, MS Excel still had a limit of 64k rows in 2005…
My first paid job as an economist was when I worked on a project, where the leader Hungarian Bank wanted to have a real estate value estimator model, based on their data. The bank outsourced the data cleaning and experimenting phase to a smaller company.
The goal of the project was to aid the in-house house appraisers to make better appraisals and reduce the chances of collusion. In those times, it was pretty common that the appraised value 'miraculously' matched the value needed for the buyers to cover the missing money to buy the property... Due to the Hungarian mortgage laws the major threat of losing money for the Bank was a VERY bad appraisal, as one still owned money if the amount from the sold property did not cover the debt.
One of the key challenges of the project was to build a training dataset, as the appraisers used Excel files for storing the complete data. The schema of the Excel files was slightly varied in each county, had different versions without storing the meta information about versioning, and demonstrated extremely creative techniques for storing information.
My all-time favorite of these techniques was that they draw lines to deselect certain possibilities of a categorical variable. So they draw a line or two to cancel 'fair' and 'good' of the condition of the property leaving 'excellent' as the real answer. If you think of their workflow, it’s easy to comprehend how these evolved, they only used the printed versions of the files, so the choice was visible and acceptable. Similar designs were bold-ing or underlining the choice.
Obviously, it’s hell to process files like these. One of the engineers I worked with, find a solution the documents got imported with an OpenOffice API (or its predecessor, I can't recall), and all cells under the drawn lines had been erased. As you might have guessed, people sometimes had drawn just a little bit longer lines than needed (look at the picture above) and useful information (like the name of the field) was deleted too.
Getting the proper schema of those Excel files was another challenge, as the fields were often renamed and slightly shuffled, just making it hard to write a regex-like program. Processing a complete import took a whole night at the end (and a complete weekend before), don't forget there were no SSD-s, no cloud computing, and parallelization of a job was much harder than now.
I had the challenging task as a data analyst to explore the different schemas and the possible textual and visual anchors that enabled the programmers to tackle the data import part. The bank had more than 100k of these files, so it was not 10 minutes of work. I can tell you, I enjoyed this type of work, it’s like being Sherlock Holmes of data.
What amuses me now, is that the bank had very extensive data about those properties, because the appraisers meticulously registered every small detail of a house, like the floor types and sizes for every room of every level, textual description, really everything. And it is a big thing, because in Hungary basically no governmental institution or private company possesses such a detailed database of properties.
As you know, I worked for a listing company until recently, making a similar model, and although it was the market leader listing company in the country, the quantity and quality of the data about properties were inferior. Being said that this was only about making a register of properties. Another thing is to represent the demand or the supply of that market.
About my proposal
During the exploration phase, realizing the aforementioned data problems, I proposed that the project should have a subproject: making a Windows application that helps to register the data for the appraisers and eliminating the data quality problems we faced with.I am kind of proud of this, even if it was not accepted, and I still have the same mindset: after we realized that data quality issues arise from the bad data generating/ registering process, we have the opportunity to intervene and make things better for the future.
This is naturally only true if the scripts written for the exploration were not enough to tackle the problem. If we could be sure, that no further alterations would arise to the data, then productionizing the exploratory scripts could suffice. In my head, this is a bit similar to arising and sunken costs: just because we have sunken costs, we should not insist on going without a change.
My other suggestion was to design a very clear form in Excel and make it compulsory to use. It would have been like the expression SST: a single source of truth. If there is a schema, and it is obeyed, then the possible challenges are halved.
About the modeling
I have to say first, that I was only involved in the first part of the modeling job, I can only share my thoughts, not the final result. There were two main reasons for this: first, I knew only about linear regression (and some variations) of machine learning at that time, and my boss, who was a mathematician, wanted to come up with something new, and secondly, there was a disagreement about the approach, that accepted all of the appraisals as a valid one, because of the appraisers were 'professionals'. This was about the target variable. I mentioned earlier that as the lower managers of the bank were interested in lending, they affected the result numbers. We checked the average prices using the official statistical records, and there was a clear misalignment.
When I wrote earlier that the created dataset represented only the register of properties, not the demand and not the supply, I meant this. In some of the cases, the accepted transaction price was available, so it could have been a safety check, but I felt much more work had been needed for having a good target variable. In the case of housing prices, the target could be according to the supply (‘ask price’), the demand (this is much harder to get - if you are not a listing site), the ‘value’ (maybe that’s what appraisers try to predict), or the market price (‘transactional price’). The problem is that most of these are not available on the property level and there is a time factor (at which point or period of time is it a true price).
Note, that in Hungary there is no public and complete dataset about housing prices and transactions, and about housing in general. The Statistical Office only publishes aggregated transaction data with a significant delay.
Having more than 300 columns of data of which 200 were regularly empty, like ‘area size of the second room of the third level’ - that exists only a fraction of the cases, and a large proportion of which was categorical, meant a real feature selection problem. I used SPSS in the modeling phase, because of the row limitation of Excel. I made some models, but the majority of the work rested with my boss.
Conclusion
This was an exceptional experience for me, for several reasons.
This was the first time when both my programmer and economist knowledge was useful. There are jokes about the programming abilities of data scientists, but one could come up with similar jokes about the business knowledge of other data scientists. This is an intersection of sciences, and I am happy to say that I studied both at the university level.
I had to cope with different kinds of stakeholders, like C-level bank managers, my boss, who was the CEO of the IT company, and several programmers (most of them showed clear uninterest in anything of data) Sometimes I failed miserably, but that’s why this is an important job.
Unfortunately, at that time I was not aware of R, it would have been a great plus.
Comments
Post a Comment