Does the OOTP ratings team ever validate the ratings?

Hoover36 · 07-02-2020, 10:34 PM

Does the OOTP ratings team ever validate the ratings? What I mean is that if you calculate the ratings for each player in a specific season...sim that seasons 100x (or 1000x, ect. to get an appropriate sample size), the ratings for each player should average out pretty close to their performance for that specific season. Do this for every season and you would have a realistic rating for each player in baseball history. Then you assign an appropriate ratings to players making it possible to pit a player from the deadball era to a player in the modern era without having to add these "league totals" and get a realistic representation of what would happen if Babe Ruth faced Clayton Kershaw?

Has this ever happened or even been discussed?

Lemandria · 07-03-2020, 03:35 AM

They don't have to "sim" anything, they have server-level access to every game played in any league anywhere. Tens of thousands of games is a pretty good sample size.

Number-cruncher's dream.

Aaannnnnddd they've all signed NDA's, so if the sims were wildly inaccurate or perfect to the seventh decimal place, they aren't gonna discuss it. But is that a sensible question anyway? Statisticians spend a lot of time wrangling over comparisons between players from entirely different eras, whose word are you going to accept for what 'accurate' means in a case like that?

When you can create cards that easily outplay any year their 'real' exemplars ever had, that goes without saying. They do create better-than-historical-best cards, every single year.

So yes, within the implicit compromises they've accepted in the name of monetary feasibility, they are 100.0000% accurate.

Hoover36 · 07-03-2020, 01:25 PM

The data collected from perfect team sim's is not relevant if the ratings aren't correct going in.

What I am saying is that if you created a specific year solo game with the ratings generated for all players (which I assume is generated by their algorithm). Using historical lineup and transactions sim that season 100x times. When you add up all the stats per player from those 100 sim and found the average season totals for each player, they should come within a close approximation of that players performance in that specific season. Do that for all players, all season, you could calculate more accurate ratings for players.

What it feels like is happening right now is a close enough rating is applied to players. Something that "feels about right". However if you used the ratings for those players in a solo season for that specific year, you get nothing close to actual performance.

mcdog512 · 07-03-2020, 09:12 PM

Quote:

Originally Posted by Hoover36

The data collected from perfect team sim's is not relevant if the ratings aren't correct going in.

What I am saying is that if you created a specific year solo game with the ratings generated for all players (which I assume is generated by their algorithm). Using historical lineup and transactions sim that season 100x times. When you add up all the stats per player from those 100 sim and found the average season totals for each player, they should come within a close approximation of that players performance in that specific season. Do that for all players, all season, you could calculate more accurate ratings for players.

What it feels like is happening right now is a close enough rating is applied to players. Something that "feels about right". However if you used the ratings for those players in a solo season for that specific year, you get nothing close to actual performance.

I hear ya, although this is an online card pack mode, not OOTP base game. Accurate ratings are important for sure for immersion but strictly not super important in the overall game.

Hoover36 · 07-04-2020, 01:22 AM

Quote:

Originally Posted by mcdog512

I hear ya, although this is an online card pack mode, not OOTP base game. Accurate ratings are important for sure for immersion but strictly not super important in the overall game.

I think that depends on who you ask. I would venture to guess that MOST would argue accurate ratings are super important.

I see a thread every day in regards to folks asking that the devs look into ratings of some sort.

Syd Thrift · 07-04-2020, 10:10 AM

Quote:

Originally Posted by Hoover36

The data collected from perfect team sim's is not relevant if the ratings aren't correct going in.

What I am saying is that if you created a specific year solo game with the ratings generated for all players (which I assume is generated by their algorithm). Using historical lineup and transactions sim that season 100x times. When you add up all the stats per player from those 100 sim and found the average season totals for each player, they should come within a close approximation of that players performance in that specific season. Do that for all players, all season, you could calculate more accurate ratings for players.

What it feels like is happening right now is a close enough rating is applied to players. Something that "feels about right". However if you used the ratings for those players in a solo season for that specific year, you get nothing close to actual performance.

They use an algorithm to create ratings out of the stats. They’ve been using the same basic algorithm for nearly 20 years and it comes pretty close for pretty much any season at this point. They don’t spend that kind of time beta testing individual historical seasons because that kind of curated season content is not what they sell (that’s more Strat-o-Matic’s thing). Which, besides, if the model works, all running simulations against the model does is prove that the model works. And if it doesn’t, you don’t change individual ratings, you figure out what caused the model to churn out bad stats and fix it. I am *sure* this kind of generalized testing has been done.

PT is not of course running against some kind of historical baseline. Your 1930 Hack Wilson doesn’t get to play against 1930 Claude “Weeping” Willoughby or Leo Strickland (who still holds iirc the record for most IP with more runs allowed than IP). Playing nothing but stars vs stars is going to do screwy things with the numbers, even if the ratings were originally “right” for the era/season/etc.

Lemandria · 07-04-2020, 10:26 AM

The sort of emperical testing he describes is unlikely to much resemble whatever ootp uses to benchmark their accuracy.

But it does sound like the sort of thing an end-user would attempt; perhaps this query should be redirected to the ootp base game forums? It's about the base engine accuracy vs historical stats, right?

And the dev team is unlikely to be able to discuss much about their internal testing.

They're satisfied, certainly.

mcdog512 · 07-04-2020, 10:26 AM

Quote:

Originally Posted by Hoover36

I think that depends on who you ask. I would venture to guess that MOST would argue accurate ratings are super important.

I see a thread every day in regards to folks asking that the devs look into ratings of some sort.

I wouldn't say they are unimportant. Who would want to play game where Mario Mendoza is far better than Ted Williams? That said, as long as they are within the ballpark so to speak I can factor them into my playing and purchasing decisions.

Hoover36 · 07-04-2020, 11:27 AM

Quote:

Originally Posted by Syd Thrift

They use an algorithm to create ratings out of the stats. They’ve been using the same basic algorithm for nearly 20 years and it comes pretty close for pretty much any season at this point.

I would disagree, it does not come pretty close. It is "ballpark" at best. It is the sole reason that I stopped buying OOTP regularly. Perfect team (for a while) has brought me back and while I find enjoyment in PT I will continue to purchase OOTP. However, if the ratings accuracy doesn't improve, I will slowly lose interest once again and return to 6.5 solely.

Quote:

Originally Posted by Syd Thrift

Playing nothing but stars vs stars is going to do screwy things with the numbers, even if the ratings were originally “right” for the era/season/etc.

Agreed. But even when playing stars vs stars, something is "off" when Ichiro still "feels" like Ichiro, while Babe feels like Adam Dunn.

RonCo · 07-04-2020, 01:30 PM

Quote:

Originally Posted by Hoover36

But even when playing stars vs stars, something is "off" when Ichiro still "feels" like Ichiro, while Babe feels like Adam Dunn.

That's an interesting comparison. By raw stats alone, it seems farcical that these two could be similar players. But when you think about it and realize that the Babe led the league in strike outs many years, you begin to scratch your head. The Babe, relative to his peers had a very high strike out rate--but his peers were not so high (Ks in the 1920s and 1930s were quite low, relative). Put the Babe into a modern setting where he still leads the league in Ks, and his batting average and doubles will plummet because less balls get put into play and his HR will drop for similar reasons. If his ks double and his batting average plummets ... well ...

Likewise, put Adam Dunn and his well above average power into an era where Ks don't happen much and his fall off the table, his average, doubles and HR rise.

I admit I'm now speaking out of my general experience, but one of the "issues" with the PT environment as I understand it is that it uses a fairly modern era as it's baseline (adjusting everyone to that modern era). That era happens to be very much Ichiro's, not so much the Babe's. I wouldn't have thought about it until you said it, but when I think about it closely it doesn't strain credibility to say the Babe would be like Adam Dunn in today's high-K world. That's kind of interesting, really.

That doesn't make it more fun to own Babe Ruth in these games, but it's an interesting conversation piece.

Hoover36 · 07-04-2020, 03:36 PM

Ichiro was an anomaly during his playing time...Same way Ruth was. To allow one to thrive and the other to be relegated reserve rosters still doesn't "feel" right. I agree it is an interesting conversation. However, ignore Ruth for now and swap in Trout who also plays in this era. Trout, similar to Ruth, can't hold his own...yet he is the best player in baseball in this era. That sounds like a ratings issue in general or an engine issue with treatment of Avoid K's.

RonCo · 07-04-2020, 03:46 PM

Quote:

Originally Posted by Hoover36

Ichiro was an anomaly during his playing time...Same way Ruth was. To allow one to thrive and the other to be relegated reserve rosters still doesn't "feel" right. I agree it is an interesting conversation. However, ignore Ruth for now and swap in Trout who also plays in this era. Trout, similar to Ruth, can't hold his own...yet he is the best player in baseball in this era. That sounds like a ratings issue in general or an engine issue with treatment of Avoid K's.

You're probably right. Bottom line is that there will be stats warping due to era differences, and stats warping due to ratings within the population varying from "norm." There will also be natural random variation--which can be quite severe to our eyes. Add that to the engine warp, and one can get some occasionally weird looking results, too.

Like I said, though, I'm out of my experience bubble with deep PT stuff.

ubernoob · 07-04-2020, 04:03 PM

Quote:

Originally Posted by Hoover36

Ichiro was an anomaly during his playing time...Same way Ruth was. To allow one to thrive and the other to be relegated reserve rosters still doesn't "feel" right. I agree it is an interesting conversation. However, ignore Ruth for now and swap in Trout who also plays in this era. Trout, similar to Ruth, can't hold his own...yet he is the best player in baseball in this era. That sounds like a ratings issue in general or an engine issue with treatment of Avoid K's.

It's an issue of league normalization.

High BABIP/Low Power players translate to PT better due to this.

Hoover36 · 07-04-2020, 04:06 PM

I actually like the stat warping due to the era difference. When I used to run leagues, this set up of pitting players from all era's against one another was my longest running league going some 70+ seasons. I am familiar with a Ruth who from 1920-1932 faced pitchers with an average of 30 Stuff, 70 Control and 87 Movement which would result in his .356 average 46 HR and 75 K's. Pitting him up against (I'll assume all SE's) 100 Stuff, 85 Control, 81 Movement, his K's would go up, average would go down...but not from .356 to .200. His power shouldn't be all that affected, he'd be getting less hits, but he is hitting against pitchers with lower movement than he faced.

Anyway, I would really love to see a validation of the ratings for the players against the players they actually faced in those specific years. It would make the stat output from facing the star vs star set up much more enjoyable and realistic.

Hoover36 · 07-04-2020, 04:13 PM

Quote:

Originally Posted by ubernoob

It's an issue of league normalization.

High BABIP/Low Power players translate to PT better due to this.

It's somewhat of a cop out to just say that without any data or anecdotal evidence to support the statement.

The issue with saying its just league normalization doesn't explain why Mike Trout gets hammered similarly to Ruth. Either Trout's ratings are wrong or the engine's handling of Avoid K's has an issue.

There are hundreds of different ways to build a team, whether real life or simulated. Forcing everyone to only use High Avoid K,High Contact players shows a flaw in the system.

Quite frankly this game is far and away too exceptional in every other regard to simply ignore this one flaw.

RonCo · 07-04-2020, 04:19 PM

Quote:

Originally Posted by Hoover36

I actually like the stat warping due to the era difference. When I used to run leagues, this set up of pitting players from all era's against one another was my longest running league going some 70+ seasons. I am familiar with a Ruth who from 1920-1932 faced pitchers with an average of 30 Stuff, 70 Control and 87 Movement which would result in his .356 average 46 HR and 75 K's. Pitting him up against (I'll assume all SE's) 100 Stuff, 85 Control, 81 Movement, his K's would go up, average would go down...but not from .356 to .200. His power shouldn't be all that affected, he'd be getting less hits, but he is hitting against pitchers with lower movement than he faced.

You might be surprised by how far Ruth's average would fall. If you take his numbers, and double his K-rate (which isn't unreasonable given the K-rate difference in eras), keep his HR, BABIP, and walk rates essentially the same, his career batting average falls to something like .237. I did that exercise some time back, so I could be a little off on that .237, but it's close. The number .244 comes to mind, too. Whatever, the difference was so startling I went back and triple checked my numbers at the time.

The impact of the Stuff/AvK match-up is a bigger driver than the MOV/Power match up because (unless the warp is HHHHUUUUUGGGGEEE) there are a lot fewer HR than Ks in an average game. In other words, a 20% warp in HR affects one event every two or three games, whereas a 20% warp in K (and BABIP for that matter) impact 2-3 plays per game.

CBeisbol · 07-04-2020, 04:37 PM

Ruth, in 1927, had a K%+ of 214. He K'd over twice as often as the league average hitter.

Move him to 2011, I think that's the year that people have said the PT stats are based on. No player in 2011 had a K%+ of higher than 200. The highest was Mark Reynolds at 176. That was a 31% K rate. If Ruth K'd at greater than twice the league rate in 2011, that would be something like a 40% strike out rate

Hard to hit for a high average when you're doing that

RonCo · 07-04-2020, 04:50 PM

Quote:

Originally Posted by ubernoob

It's an issue of league normalization.

High BABIP/Low Power players translate to PT better due to this.

Just doing mental gymnastics (and noting again, that I have ZERO real familiarity to PT), I'd guess Power is a small factor, but that AvK and BABIP have higher impact in any translation from era to era. Mike Trout is a fairly high K-rate guy over his career, and translating him back even ten years will warp him in ways that could be big enough to affect his resulting batting average in noticeable ways.

This isn't a design "flaw," as Hoover36 is calling it, so much as it is an indication of the real world issue that moving players from era to era will naturally result in a squeezing of the results in some ways that might feel unnatural. At the end of the day, there are only so many plate appearances. The designer has to find the least offensive way to put all that toothpaste back into the tube, and there won't be anything that's "right."

ubernoob · 07-04-2020, 05:17 PM

Quote:

Originally Posted by RonCo

Just doing mental gymnastics (and noting again, that I have ZERO real familiarity to PT), I'd guess Power is a small factor, but that AvK and BABIP have higher impact in any translation from era to era. Mike Trout is a fairly high K-rate guy over his career, and translating him back even ten years will warp him in ways that could be big enough to affect his resulting batting average in noticeable ways.

This isn't a design "flaw," as Hoover36 is calling it, so much as it is an indication of the real world issue that moving players from era to era will naturally result in a squeezing of the results in some ways that might feel unnatural. At the end of the day, there are only so many plate appearances. The designer has to find the least offensive way to put all that toothpaste back into the tube, and there won't be anything that's "right."

No, it's the fact that there can only be so many HRs in any given league (+/- a small amount) and everyone with power is fighting it out for those HRs. So when they hit way less than normal due to this, their average plummets because they aren't hitting singles or doubles to make up for the lost HRs.

Dropping from 50-60 HR to 20-30 is 30 lost hits a year, that's a big chunk of average for sluggers with eye.

RonCo · 07-04-2020, 05:28 PM

Quote:

Originally Posted by ubernoob

No, it's the fact that there can only be so many HRs in any given league (+/- a small amount) and everyone with power is fighting it out for those HRs. So when they hit way less than normal due to this, their average plummets because they aren't hitting singles or doubles to make up for the lost HRs.

Again, from the other thread, your assumption of distributing a set number of HR is not actually how the base engine works. I know it can appear that way, but I'm about as sure as a non-developer can be that--as long as the PT game engine is like the base game in its base function--you're not correct in that assessment.

07-02-2020, 10:34 PM	#1
Hoover36 Minors (Double A) Join Date: Mar 2003 Location: NV Posts: 195	Does the OOTP ratings team ever validate the ratings? Does the OOTP ratings team ever validate the ratings? What I mean is that if you calculate the ratings for each player in a specific season...sim that seasons 100x (or 1000x, ect. to get an appropriate sample size), the ratings for each player should average out pretty close to their performance for that specific season. Do this for every season and you would have a realistic rating for each player in baseball history. Then you assign an appropriate ratings to players making it possible to pit a player from the deadball era to a player in the modern era without having to add these "league totals" and get a realistic representation of what would happen if Babe Ruth faced Clayton Kershaw? Has this ever happened or even been discussed?

07-03-2020, 03:35 AM	#2
Lemandria All Star Reserve Join Date: Sep 2019 Location: Chicagoland Posts: 702	They don't have to "sim" anything, they have server-level access to every game played in any league anywhere. Tens of thousands of games is a pretty good sample size. Number-cruncher's dream. Aaannnnnddd they've all signed NDA's, so if the sims were wildly inaccurate or perfect to the seventh decimal place, they aren't gonna discuss it. But is that a sensible question anyway? Statisticians spend a lot of time wrangling over comparisons between players from entirely different eras, whose word are you going to accept for what 'accurate' means in a case like that? When you can create cards that easily outplay any year their 'real' exemplars ever had, that goes without saying. They do create better-than-historical-best cards, every single year. So yes, within the implicit compromises they've accepted in the name of monetary feasibility, they are 100.0000% accurate. __________________ FOTF victim Farewell Last edited by Lemandria; 07-03-2020 at 04:11 AM.

07-04-2020, 10:26 AM	#7
Lemandria All Star Reserve Join Date: Sep 2019 Location: Chicagoland Posts: 702	The sort of emperical testing he describes is unlikely to much resemble whatever ootp uses to benchmark their accuracy. But it does sound like the sort of thing an end-user would attempt; perhaps this query should be redirected to the ootp base game forums? It's about the base engine accuracy vs historical stats, right? And the dev team is unlikely to be able to discuss much about their internal testing. They're satisfied, certainly. __________________ FOTF victim Farewell Last edited by Lemandria; 07-04-2020 at 10:27 AM.

07-03-2020, 01:25 PM	#3
Hoover36 Minors (Double A) Join Date: Mar 2003 Location: NV Posts: 195	The data collected from perfect team sim's is not relevant if the ratings aren't correct going in. What I am saying is that if you created a specific year solo game with the ratings generated for all players (which I assume is generated by their algorithm). Using historical lineup and transactions sim that season 100x times. When you add up all the stats per player from those 100 sim and found the average season totals for each player, they should come within a close approximation of that players performance in that specific season. Do that for all players, all season, you could calculate more accurate ratings for players. What it feels like is happening right now is a close enough rating is applied to players. Something that "feels about right". However if you used the ratings for those players in a solo season for that specific year, you get nothing close to actual performance.

07-04-2020, 03:36 PM	#11
Hoover36 Minors (Double A) Join Date: Mar 2003 Location: NV Posts: 195	Ichiro was an anomaly during his playing time...Same way Ruth was. To allow one to thrive and the other to be relegated reserve rosters still doesn't "feel" right. I agree it is an interesting conversation. However, ignore Ruth for now and swap in Trout who also plays in this era. Trout, similar to Ruth, can't hold his own...yet he is the best player in baseball in this era. That sounds like a ratings issue in general or an engine issue with treatment of Avoid K's.

07-04-2020, 04:06 PM	#14
Hoover36 Minors (Double A) Join Date: Mar 2003 Location: NV Posts: 195	I actually like the stat warping due to the era difference. When I used to run leagues, this set up of pitting players from all era's against one another was my longest running league going some 70+ seasons. I am familiar with a Ruth who from 1920-1932 faced pitchers with an average of 30 Stuff, 70 Control and 87 Movement which would result in his .356 average 46 HR and 75 K's. Pitting him up against (I'll assume all SE's) 100 Stuff, 85 Control, 81 Movement, his K's would go up, average would go down...but not from .356 to .200. His power shouldn't be all that affected, he'd be getting less hits, but he is hitting against pitchers with lower movement than he faced. Anyway, I would really love to see a validation of the ratings for the players against the players they actually faced in those specific years. It would make the stat output from facing the star vs star set up much more enjoyable and realistic.

07-04-2020, 04:37 PM	#17
CBeisbol Banned Join Date: Aug 2019 Location: Ban land in 3...2... Posts: 2,943	Ruth, in 1927, had a K%+ of 214. He K'd over twice as often as the league average hitter. Move him to 2011, I think that's the year that people have said the PT stats are based on. No player in 2011 had a K%+ of higher than 200. The highest was Mark Reynolds at 176. That was a 31% K rate. If Ruth K'd at greater than twice the league rate in 2011, that would be something like a 40% strike out rate Hard to hit for a high average when you're doing that