need help in parsing the address string using any language

Status
Not open for further replies.

innovator_

New Member
for example if user enters Passing the parseAddress function "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" returns:

2299 Lewes-Georgetown Hwy
A. P. Croll & Son
Georgetown
DE
19947

i have thought of the following algorithm but i am having trouble implementing it...can anyone help me write the code? i know C and a bit of C++, so would prefer if the code was in these languages..

my algo

1)Work backward. Start from the zip code, which will be near the end, and in one of two known formats: XXXXX or XXXXX-XXXX. If this doesn't appear, you can assume you're in the city, state portion, below.

The next thing, before the zip, is going to be the state, and it'll be either in a two-letter format, or as words. You know what these will be, too -- there's only 50 of them. Also, you could soundex the words to help compensate for spelling errors.
before that is the city, and it's probably on the same line as the state.

You could use a zip-code database to check the city and state based on the zip, or at least use it as a BS detector.
The street address will generally be one or two lines. The second line will generally be the suite number if there is one, but it could also be a PO box.

It's going to be near-impossible to detect a name on the first or second line, though if it's not prefixed with a number (or if it's prefixed with an "attn:" or "attention to:" it could give you a hint as to whether it's a name or an address line.

any help would be appreciated
 
If the user is entering this information why not have him enter it into appropriate fields ?

You are on the right track but I expect you will never find a 100% solution. Note that words like Colorado and Montana can be personal names or states.
 
This question comes up all the time on a database forum that I frequent, and unfortunately there is no way to account for every possible way that someone may enter an address. Remember that city names can be two words long: St. Louis, New Orleans. State names may be abbreviated or spelled out, and when spelled out can be one or two words. If you also need to allow for international addresses, you open up another can of worms. As 3v0 says, the best solution is to have the user enter the data into the correct fields in the first place. If you are dealing with an existing address list, then you may be able to come up with some rudimentary function to parse the address, but then you will have to check it manually for all of the exceptions. This is the route I've taken many times trying to clean up badly formatted addresses. I've concocted all kinds of parsing functions, depending on what the input data looks like. but the output still has to be checked by hand and manually reformatted. There is not going to be a fully automated solution.
 
Last edited:
Hello,


Yes, that is a tough one there. I dont think it's possible (as others already mentioned here) to parse every single possibility and come out with 100 percent accuracy. Some things still require human intervention at least on some level after some amount of pre-processing. I noticed this when using OCR too.

One suggestion would be to have a set of possible names on file where the file could be searched for name inputs. That would mean you would absolutely have to have a name file already available or else this can not be done of course.
Another idea would be to have the program identify problem entries and scoot them off to a separate file to be double checked by a human.
In any case, i would bet a human should check them ALL over at least once.

Other problematic entries might be:

James, James, James, and Clien, 222-225 Alverest Ave, Suites A2,A3,A4, 99999-2345
Fred Flintstone, One Twenty Three E. Rocky Avenue, Bedrock CA 98904
 
Last edited:
well yes i understand its almost impossible to a code that would be 100% accurate. luckily though we've been told that we can make any assumptions we like, so maybe you could restrict the code to some standard format and write the code just based on that..... how would one write code then?
 
If I had to do this I would look at using a weighted rule based system.

Use rules to assign point for the field each line or word belongs in.

For example if a line contained "AVE" we would give several points to the possibility of being a street address

Make up as many rules as you can. When you find a difficult address try to write one or more rules that would help parse it.

When all the rules have been run add up the points and assign text to fields based on which field type got the most points.

The correctness will depend on the quality of the rules.
 
Last edited:
well thnx a lot, but i was looking for the code....i have lots of ideas to make it better but i get blanked out when i have to put it in code....i thought maybe if i could have a basic code to start with i could owrk on it
 
public class Address
{
public string Street {get;set;}; // Lunkad Tower, 6th floor
public string Locality {get;set;}; // Viman Nagar
public string City {get;set;}; // Pune
public string State {get;set;}; // MH, Maharashtra
public string PostalCode {get;set;}; // 60611
public string Country {get;set;}; // e.g. India, IN
}

can anyone help me write the code?
 
need help in parsing an address string

the structure of the address should be as below

public class Address
{
public string Street {get;set;}; // Lunkad Tower, 6th floor
public string Locality {get;set;}; // Viman Nagar
public string City {get;set;}; // Pune
public string State {get;set;}; // MH, Maharashtra
public string PostalCode {get;set;}; // 60611
public string Country {get;set;}; // e.g. India, IN
}

can anyone help me write the code?
 
Hello,


In short if your data uses delimiters, you do a search for each delimiter and copy the string up to that point into a buffer for that object. You then search for the next delimiter, etc. That's about the simplest example.
 
for example if user enters Passing the parseAddress function "A. P. Croll & Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947" returns:

2299 Lewes-Georgetown Hwy
A. P. Croll & Son
Georgetown
DE
19947

The example that you give has the entire address on one line. Then later in the same post you say:
...before that is the city, and it's probably on the same line as the state.
which implies that the input is not one single line. Your explanation is not consistent. You need to provide a clear description of the problem before you can hope to find a solution. As has already been mentioned, this is a messy problem at the best of times, but if you can't even explain what your input looks like, then you're doomed to failure before you ever start.
 
Last edited:
Status
Not open for further replies.
Cookies are required to use this site. You must accept them to continue using the site. Learn more…