View Single Post
Old 01-01-2013, 12:58 PM   #25
geckon
Hall Of Famer
 
geckon's Avatar
 
Join Date: Nov 2012
Location: Czech Republic
Posts: 2,077
Time for a little update and few more questions

I searched for Czech name databases and found some. I have three possible plans. Please tell me which one you find better for the game.

1) Use the database provided by Czech ministry for internal affairs. It contains all names used in Czech Republic and can be sorted according to frequency of newborns in any year since 1890. So I could pick one year and take lets say 200 most used names that year for newly born children.
Advantages:
- Old names used lets say fifty years ago won't be included (or less often) although people with such names still live.
Disadvantages:
- Statistical error. Let's say that in the picked year there were uncommonly few Jans born although normally this name is very common. The resulting file would be affected by such abnormality.
- I would have to sort female names out by hand since the database contains male and female names together.

2) Use another database which is probably based on the same or similar data but shows only overall numbers of living persons with given names.
Advantages:
- It doesn't have the statistical problem described above.
- I can filter out only male names automatically so less work for me
Disadvantages:
- It includes older names that are not used much nowadays.
- It includes some Slovak names which is I think mostly reflecting Slovaks who stayed in Czech Republic after Czechoslovakia split-up (so no the people born after 1993).

3) Just take the file given by JeffR and edited by no way and edit it more - make some corrections, add missing names, delete few names etc.
Advantages:
- Probably the fastest option. (Maybe the 2nd one could be faster, hard to say.)
Disadvantages:
- Certainly not as accurate as the others.

Questions:
- (Valid only for the first plan) What year to pick? Since I suppose the first newgens will be generated for the 2013/2014 season it should probably be something like 1990-2000. Do you agree? Which one would you pick and why?
- (For developers only) What range is used for the name frequency in the name generator? I need to know that to be able to normalize the real frequency numbers.
- How many names would you include into that file?

I personally prefer probably the second option for it should probably be both fast and precise enough.

Last edited by geckon; 01-01-2013 at 01:26 PM.
geckon is offline   Reply With Quote