This project was made for Nestle Polska S.A. from Torun. After an internship in Inter-tel, I was able to use my experience gained in MS Access and VBA in a bigger company.
Nestle Polska S.A. company has a lot of companies with whom it collaborates. Every day sales representatives add tens of new companies into the database. Unfortunately while doing so they sometimes made mistakes, i.e. add company that already exists in the database under the slightly different name, misspell the name of a city, street, company (e.g. Polish city Białystok was written in more than 40 ways!), do not give all required information.
I created a script for automatic correction of contractor's data with SAP, MS Access, MS Excel and VBA. I needed to analyze existing big dataset (over 300 thousands rows), to find and correct details or remove duplicated companies from it.
The idea was to create an algorithm that corrects data associated with companies and finds duplicates. Corrected data was then used to summarize cargo for each company.
The final solution was made in two steps:
Solution #1 - automatic correction of address data
Step #2 - finding duplicated company names
Because of misspellings issues in companies' data, it was hard to indicate duplicated companies.
Misspellings and lack of crucial data like VAT identification number, the name of a city, street etc. were the main problems in finding duplicated companies.
To solve those problems I created a script that corrects and fills data automatically. The main part of it were tables which contained known misspellings and valid data for them:
- a table of city names e.g. Bialystok → Białystok, (two cities in one field) Gdansk(Sopot) → Sopot etc.,
- a table of street names e.g. (Polish abbreviation of the word 'street') ul. → empty_sign, Currie Sklodowskiej → Currie-Skłodowskiej etc.,
- a table of wrong postcodes e.g. Grębocin with 87-162 postcode is corrected to 87-122,
- a table of correct postcodes e.g. if the blank postcode is found, based on other information correct postcode is put if possible.
After correction of data, I tried to distinguish duplicated companies in the database.
This part was based on one-to-one database relationships on a variety of columns. The hardest thing was to find the best columns to connect. The easiest way to do it was to connect companies by VAT identification number. Another possible way was a relationship based on company' address and name.
When I found the best solutions results of it were more than satisfying!
Sometimes system required user attention to correct the data or create new rules in tables mentioned above. The longer system works, the more rules it contained and the less user attention was required.
Using created application it was shown that ~300 000 companies written into the database were in fact around 110 000 unique companies after all.
In first 1 month, more than 5 000 rules of correction were written into the system. Even with tens (up to 100) of new companies which were added to the database each day, the script had a doubt with no more than 10% of them that required a user intervention: checking displayed clients and adding new rules.