How does the NCAA Volleyball Player of the Year Get Chosen?

Introduction

For the last 3 years, I’ve been obsessed with volleyball. Growing up, I played tennis and basketball and dropped them in college as life got in the way. During my Master’s program, I found an intramural group that played volleyball and it was like combining the best parts of tennis and basketball. The non-contact aspect of tennis (who’s trying to get hurt at age 23?!) and the jumping aspect of basketball. I fell in love immediately and began to consume volleyball content whatever way I can. One of those ways involved watching a lot of NCAA Men’s Volleyball.

If you’re obsessed with volleyball like me, you know that Moni Nikolov took the NCAA by storm last year. As a 17 year old 6’9 starter for the Bulgarian national team, he quickly made his presence known in America. It just seemed like he was leagues above everyone else, eventually leading to a chip and a NCAA Player of the Year award to wrap it all up. (I actually had the chance to go watch Moni Nikolov last year at Lewis University set the collegiate record for serve speed!)

Now if you watched any of his games, you’d be like, yeah, its super obvious but does the data back it up? That’s what I’m going to try to answer today. A quick Google search didn’t give a satisfying answer, a lot of it was for women’s volleyball or men’s basketball. And for men’s basketball, it seems its determined by a panel of media members or a selection committee.

I’m sort of surprised the SEO for using data science to pick the player of the year (atleast for basketball) isn’t first page worthy. (Google fix your algo)

So let’s do it ourselves! I found this R package called ‘naccvolleyballr’ which is basically an api wrapper for volleyball stats from the NCAA website and fortunately for us, it has all player statistics for the 2024 season. Let’s take a quick look at what variables we have with the skimr package.

Data summary
Name	mvb_playerseason_div1_202…
Number of rows	1077
Number of columns	23
_______________________
Column type frequency:
character	5
numeric	18
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
team	1	3	19	66
conference	1	3	20	9
player	1	7	24	1074
yr	1	2	3	5
pos	1	1	4	9

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
height_cm	25	0.98	191.93	8.36	165.1	187.96	193.04	198.12	213.36	▁▃▇▇▁
gp	0	1.00	16.62	8.96	1.0	9.00	18.00	24.00	33.00	▆▃▅▇▃
gs	0	1.00	9.64	10.34	0.0	0.00	5.00	20.00	33.00	▇▂▂▃▁
s	0	1.00	50.33	33.90	1.0	18.00	50.00	81.00	121.00	▇▅▅▆▂
kills	0	1.00	67.36	92.42	0.0	2.00	24.00	100.00	525.00	▇▂▁▁▁
errors	0	1.00	27.55	37.69	0.0	1.00	10.00	36.00	173.00	▇▁▁▁▁
total_attacks	0	1.00	156.74	208.17	0.0	7.00	62.00	231.00	1051.00	▇▂▁▁▁
hit_pct	93	0.91	0.21	0.26	-1.0	0.10	0.23	0.33	1.00	▁▁▆▇▁
assists	0	1.00	62.82	190.00	0.0	1.00	6.00	20.00	1229.00	▇▁▁▁▁
aces	0	1.00	8.09	10.70	0.0	0.00	4.00	12.00	106.00	▇▁▁▁▁
s_err	0	1.00	23.71	26.29	0.0	3.00	14.00	39.00	135.00	▇▂▁▁▁
digs	0	1.00	49.97	54.99	0.0	7.00	27.00	85.00	287.00	▇▂▂▁▁
r_err	0	1.00	6.02	10.01	0.0	0.00	1.00	7.00	53.00	▇▁▁▁▁
block_solos	0	1.00	3.19	4.88	0.0	0.00	1.00	5.00	38.00	▇▁▁▁▁
block_assists	0	1.00	17.31	21.51	0.0	0.00	8.00	27.00	104.00	▇▂▁▁▁
b_err	0	1.00	2.03	2.99	0.0	0.00	1.00	3.00	21.00	▇▁▁▁▁
pts	0	1.00	87.30	110.78	0.0	3.00	34.50	140.50	600.50	▇▂▁▁▁
bhe	0	1.00	0.32	0.86	0.0	0.00	0.00	0.00	11.00	▇▁▁▁▁

Each player has a variety of statistics such as hit_pct, kills, assists, etc. My first task in trying to answer the question, “did Moni deserve to win the Player of the Year” is coming up with a way to fairly judge a player’s contribution and impact on a team. Since player’s have different roles, they also will have different distributions when it comes to kills, digs, blocks, etc. A setter for instance will touch the ball way more than an attacker but in return, attack less and receive less. See the follow graph that shows the distribution of kills/set depending on which position you play.

Pin hitters (OH, OPPS, RS) tend to have a higher median when it comes to kills/set where setters, and liberos, and defensive specialists have lower kills/set. How do we fairly assess players?

Moni’s Season Statistics

During the 2024-2025 season, Moni had a hitting percentage of 0.398 ((Kills – Errors) / Total Attempts)) and averaged nearly an ace a set while bringing in on average 10 assists per set.

# A tibble: 16 × 3
   stat             value     rate
   <chr>            <dbl>    <dbl>
 1 gp              33      0.3    
 2 s              110      1      
 3 kills          173      1.57   
 4 errors          48      0.436  
 5 total_attacks  314      2.85   
 6 hit_pct          0.398  0.00362
 7 assists       1109     10.1    
 8 aces           106      0.964  
 9 s_err          128      1.16   
10 digs           168      1.53   
11 r_err            0      0      
12 block_solos      9      0.0818 
13 block_assists   72      0.655  
14 b_err            5      0.0455 
15 pts            324      2.95   
16 bhe              0      0

Comparing Moni to Other Players

I first turned to Google to try to see if I could find some commonly used metrics or composite scores but nothing easily popped out. Then I went on Google Scholar to see if anyone’s made any papers. My cursory search led to nothing sadly. Then I went on ChatGPT and it was able to generate some suggestions. Let’s give those suggestions a try and see where we go from there.

ChatGPT first gave me some suggested metrics to assess players:

We can compare using the normalized z-score composite:

z = (player_stat - position_mean)/position_sd and generate composite score based off these z-scores:

PIS = 0.35 * Offensive_Value + 0.35 * Defensive_Value + 0.20 * Serve_Value + 0.10 * Playmaking_Value

where off = 0.40z_kill_rate + 0.40z_hitting_pct + 0.20z_assist_rate def = 0.55z_block_rate + 0.45z_dig_rate + 0.10z_rerr_rate srv = 0.60z_ace_rate + 0.40z_serr_rate pm = 0.70z_assist_rate + 0.30z_bhe_rate

Contribution per touch (CPT) value per touch = points contributed / touches

Win Probability Added (WPA) Player WPA = total win probability added over match

These last two seem like they require more in-depth data such as play by play data which ncaavolleyballr does have but for now lets save this for later and focus on the first one. To begin, let’s also start a ground rule where you must have played in minimum 30 sets to be considered. I do this so that players who might have played extraordinarily well for 5 sets then never stepped foot on the court again aren’t considered and mess up the distribution. I think 30 sets or ranging from 6-10 matches allows things to even out a little.

rank	team	player	pos	PIS	z_kill_rate	z_hitting_pct	z_block_rate	z_dig_rate	z_assist_rate	z_ace_rate	z_serr_rate	z_rerr_rate	z_error_rate	z_bhe_rate	off	def	srv	pm
1	Fort Valley St.	Oshane Farquharson	OH	2.2876167	0.04223372	0.5460240	-0.1248540	0.398857825	14.9584002	-0.06562512	0.51533964	1.4734563	1.19911003	-0.4208078	3.2269831	0.2581620	0.16676079	10.3446378
2	Charleston (WV)	Joonatan Salpakari	L	1.6858209	0.46489347	0.2329830	5.5009689	0.395221217	1.5513440	4.47626352	-3.58716221	-0.2331110	-0.15929137	0.2575454	0.5894194	3.1800714	1.25089322	1.1632044
3	Lincoln Memorial	Matthew Gentry	MB	1.6836298	2.21719835	1.9373855	1.8092538	1.500345732	1.1291423	3.83282952	-1.57707428	0.3494200	0.96361025	0.4439664	1.8876620	1.7051872	1.66886800	0.9235896
4	Long Beach St.	Moni Nikolov	S	1.4190775	1.04406146	0.8095261	1.8778209	0.718102238	0.9987032	5.71071019	-2.14367255	0.3450180	-0.09982329	0.6436025	0.9411757	1.3904493	2.56895709	0.8921730
5	Pepperdine	George Dyer	L	1.2923005	0.56548640	0.3435073	4.9716384	-0.242441577	1.1782507	4.54124810	-6.07339789	0.4121764	-0.08789357	0.2575454	0.5992476	2.6665200	0.29538971	0.9020391
6	Daemen	Kyle Zelasko	DB	1.1338106	1.67357090	1.1552035	2.1055685	1.759245424	0.6107251	0.06078953	-0.07767272	-1.2154939	0.42649828	0.3779645	1.2536548	1.8281737	0.00540463	0.5408969
7	UC Irvine	Hilir Henno	OH	1.0790196	0.90544813	1.0494117	0.6095088	0.706870816	0.6294195	4.66231184	-0.91715945	-0.2350351	0.77864949	0.3559813	0.9078278	0.6298182	2.43052332	0.5473880
8	Loyola Chicago	Nicodemus Meyer	MB	0.9732250	1.60362392	1.6110706	1.8633624	0.009975925	1.5639833	-0.65347260	0.02715742	0.1961918	1.20901295	0.4439664	1.5986744	1.0489577	-0.38122059	1.2279782
9	Mount Olive	Jackson Lahey	S	0.9666382	0.49616855	0.6160337	2.4327592	0.264407285	1.2108698	1.56662533	-1.09901114	0.3450180	0.54309281	0.6436025	0.6870548	1.4915026	0.50037074	1.0406896
10	Loyola Chicago	Ryan McElligott	S	0.9642723	1.07447506	0.9008975	1.7462550	1.142528109	1.2004207	1.14936006	-1.99028778	0.1585909	0.06788611	0.6436025	1.0302332	1.4904370	-0.10649908	1.0333752

Using this composite score, Moni Nikolov is 4! His kill rate is better than about 84% of players. His ace rate is absolutely insane with a z-score of 5.7. This composite argues in favor of Moni Nikolov being the player of the year. Who are the other three that are rated higher than Moni?

Oshane Farquharson is a 6-4 outside hitter for the Fort Valley St. Wildcats where they have won back to back conference (SIAC) champions and the first HBCU to earn consecutive appearances at Nationals. Joonatan Salpakari is a 6-3 outside hitter for the Charleston Golden Eagles where they went 12-17 and lost in the Quarterfinals of the EIVA tournament. Matthew Gentry is a 6-6 middle blocker for the Lincoln Memorial Railsplitters and they went 24-1 last season and won their IVA tournament.

Sidenote: Oshane has an insane assist rate z-score of 13.54 for an OH. He averages 7.04 assists per set. Is this really true for an OH? Turns out…no! After some ad hoc diving and some footage review, I think he was misreported as an outside but he’s actually a setter.

This score seems to argue in favor of Nikolov’s award but what about winning? At the end of the day, its all about winning. How did the top 10 players end up win wise?

rank	team	player	PIS	Overall.record
1	Fort Valley St.	Oshane Farquharson	2.2876167	16-10 (0.615)
2	Charleston (WV)	Joonatan Salpakari	1.6858209	12-17 (0.414)
3	Lincoln Memorial	Matthew Gentry	1.6836298	NA
4	Long Beach St.	Moni Nikolov	1.4190775	30-3 (0.909)
5	Pepperdine	George Dyer	1.2923005	21-10 (0.677)
6	Daemen	Kyle Zelasko	1.1338106	NA
7	UC Irvine	Hilir Henno	1.0790196	21-7 (0.750)
8	Loyola Chicago	Nicodemus Meyer	0.9732250	25-4 (0.862)
9	Mount Olive	Jackson Lahey	0.9666382	20-4 (0.833)
10	Loyola Chicago	Ryan McElligott	0.9642723	25-4 (0.862)

Had to manually look it up because it didn’t get scraped but LMU had a record of 24-1 and Daemen had a record of 15-13 however, if you look at their opponents, their opponents are hardly against top ranked teams. For instance, LMU’s only lost came at the hands of a nationally ranked team where they proceeded to lose 3-0. Outside of LMU, LBSU is the only team that had above a 0.900 win rate.

On top of Moni being a top 5 player according to this metric, his team also had one of the highest win rate and they won the national championship. This kid’s pretty good I suppose!

Comparing Moni to Previous Player of the Years

Moni’s pretty good but how does he stack up against other Player of the Years. Where does he rank? What do player of the years have in common?

There have been 35 Players of the Year from 1991-2025. Sadly the R package does not go beyond 2020 and https://stats.ncaa.org/rankings/change_sport_year_div goes back to 2008. After some time spent trying to scrape the data from stats.ncaa.org, I just ended up grabbing the data manually. The juice was not worth the squeeze especially for 18 rows.

Data summary
Name	ncaa_poy_2009_2025
Number of rows	17
Number of columns	21
_______________________
Column type frequency:
character	5
numeric	16
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
name	1	10	21	16
pos	1	1	3	3
year	1	7	7	17
team	1	3	19	9
class	1	3	3	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
s	0	1.00	99.29	16.23	61.00	96.00	102.00	105.00	120.0	▂▁▃▇▃
ms	7	0.59	24.00	12.21	1.00	28.00	29.00	30.00	33.0	▂▁▁▁▇
kills	0	1.00	356.12	181.10	46.00	244.00	388.00	499.00	661.0	▅▅▆▇▃
errors	0	1.00	91.94	43.75	11.00	77.00	101.00	121.00	157.0	▃▁▂▇▃
total_attacks	0	1.00	711.06	372.72	97.00	487.00	755.00	982.00	1287.0	▇▃▇▆▇
hit_pct	0	1.00	0.37	0.05	0.28	0.35	0.38	0.39	0.5	▂▃▇▁▁
assists	0	1.00	285.18	492.82	2.00	20.00	31.00	62.00	1455.0	▇▁▁▂▁
aces	0	1.00	44.71	21.98	19.00	29.00	38.00	57.00	106.0	▇▂▆▁▁
s_err	3	0.82	72.79	31.76	22.00	53.00	74.50	95.00	128.0	▆▃▇▆▃
digs	0	1.00	171.24	43.23	93.00	146.00	168.00	204.00	258.0	▃▆▇▇▂
r_err	6	0.65	18.64	16.30	1.00	7.50	15.00	22.00	55.0	▃▇▁▁▁
block_solos	0	1.00	7.94	3.34	1.00	6.00	8.00	10.00	13.0	▂▂▇▃▅
block_assists	0	1.00	57.71	14.52	38.00	51.00	54.00	64.00	101.0	▃▇▃▁▁
b_err	4	0.76	5.69	3.95	1.00	2.00	5.00	7.00	14.0	▇▇▆▁▃
pts	3	0.82	433.93	172.09	96.50	335.38	474.75	559.25	646.0	▃▁▅▆▇
bhe	7	0.59	2.20	1.14	0.00	2.00	2.00	3.00	4.0	▂▂▇▆▂

What do the best players have in common? Their hitting percentages are quite high, with only one player below 0.300 (0.275) with most hovering between 0.350 and 0.400.

rank	team	name	pos	year	PIS	z_kill_rate	z_hitting_pct	z_block_rate	z_dig_rate	z_assist_rate	z_ace_rate	z_serr_rate	z_rerr_rate	z_error_rate	z_bhe_rate	off	def	srv	pm
1	Stanford	Kawika Shoji	S	2009-10	0.9526433786	0.47614078	0.8612867	-0.16575457	2.45262926	2.2593258	-0.7237880	1.5843333	0.79489130	0.83950785	0.9001090	0.98683615	1.09200729	0.19946050	1.8515608
2	Pepperdine	Paul Carroll	OPP	2008-09	0.6548768259	0.11127327	0.3710930	1.95022403	1.22242132	-0.5487738	-0.7112227	1.5843333	0.79489130	0.50565211	0.9001090	0.08319173	1.70220194	0.20699969	-0.1141090
3	Long Beach St.	Moni Nikolov	S	2024-25	0.5935562139	1.03154555	0.4989696	0.33198844	-0.63404115	1.4851375	2.2346965	-1.4458687	0.79489130	-0.72849153	0.9001090	0.90923356	-0.02323574	0.76247042	1.3096289
4	Long Beach St.	Josh Tuaniga	S	2017-18	0.4682620409	2.44287346	2.5663084	-1.63919565	-0.03720823	1.5574282	-0.7340073	-0.3884545	0.79489130	0.84887278	-1.2773958	2.31515840	-0.83881218	-0.59578618	0.7069810
5	BYU	Gabi Garcia Fernandez	OPP	2019-20	0.2671344466	-0.19839555	-0.6732328	1.72325838	-0.64294582	-0.5554003	2.0328446	-0.6782279	0.56482283	-0.93811731	0.9001090	-0.45973141	0.71494878	0.94841560	-0.1187475
6	Long Beach St.	TJ DeFalco	OH	2018-19	0.0451325015	1.05816955	0.4776568	0.07380840	-0.40539275	-0.4600436	0.3953348	-1.0447858	-0.14971674	-0.82794399	-0.4817690	0.52232182	-0.15680379	-0.18071343	-0.4665612
7	Hawaii	Rado Parapunov	OPP	2020-21	0.0192907452	-0.07054534	-0.9716116	1.61551205	-0.08621352	-0.5759844	-0.4643753	-0.5389933	0.57898089	-1.71788308	0.9001090	-0.53205968	0.90763363	-0.49422251	-0.1331564
8	Loyola Chicago	Thomas Jaeschke	OH	2014-15	0.0007796494	0.13569402	0.1366525	-0.63196054	1.03889326	-0.5096668	0.5052812	-0.5586074	-0.81319143	0.02749145	0.1515917	0.00700524	0.03860453	0.07972577	-0.3112892
9	Southern California	Murphy Troy	OPP	2010-11	-0.0463428840	-0.27262867	-0.3535413	0.28098023	-0.54800475	-0.5648417	-0.2412527	1.5843333	0.79489130	-0.23011595	0.9001090	-0.36343632	-0.01257388	0.48898170	-0.1253565
10	Long Beach St.	TJ DeFalco	OH	2016-17	-0.0894557102	0.28033217	0.4137185	0.04994302	0.02393460	-0.4958450	-0.7287474	-0.3049011	-0.09944348	0.32141887	-0.5088647	0.17845127	0.02829489	-0.55920888	-0.4997509

Moni’s ranked third on this list, right behind Kawika Shoji (Erik Shoji’s older brother) and Paul Carroll. Do you agree with this assessment?

Player Profiles

To wrap this up, let’s take a closer look by performing a Principle Component Analysis using “hit_pct”, “kill_rate”, “error_rate”, “ace_rate”, “serr_rate”, “dig_rate”, “rerr_rate”, “block_rate”, “block_error_rate”, “assist_rate”, and “bhe_rate” to generate player profiles. As alwasy make sure to scale and center variables since the center and scales are different across the variables.

Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.7633 1.6509 1.4065 1.1862 0.80125 0.70061 0.50223
Proportion of Variance 0.2826 0.2478 0.1798 0.1279 0.05836 0.04462 0.02293
Cumulative Proportion  0.2826 0.5304 0.7103 0.8382 0.89653 0.94115 0.96408
                           PC8     PC9    PC10     PC11
Standard deviation     0.42532 0.38700 0.25391 0.002035
Proportion of Variance 0.01644 0.01362 0.00586 0.000000
Cumulative Proportion  0.98052 0.99414 1.00000 1.000000

                         PC1         PC2         PC3
hit_pct          -0.23988215  0.44084055 -0.26601060
kill_rate        -0.35039457  0.22872112 -0.41914811
error_rate       -0.11745159 -0.45712062 -0.18337249
ace_rate         -0.23476064 -0.37701756 -0.29889921
serr_rate        -0.46631724 -0.23712738 -0.03771777
dig_rate          0.33019383  0.38276045 -0.17681749
rerr_rate        -0.19077279  0.02111884  0.50554668
block_rate        0.23890069 -0.32318354 -0.30502767
block_error_rate -0.45163448  0.10242376 -0.04026420
assist_rate      -0.06833274  0.25493365 -0.32246529
bhe_rate         -0.34954566  0.13832714  0.37904559

3 Principal Components explain roughly 71% of the variance and if we take a look at those three loadings, we can pick out different characteristics. For PC1, the only positives are dig rate and block rate while the rest are negative so high PC1 seems to align with defense ability while low values correspond to perhaps more offensive players. PC2 has high values in hit_pct, dig_rate, assist_rate so it seems like this loading is describing a fundamentally sound or high variance player. PC3 seems to be be about ball control seeing the high values for receiving error and ball handling error.

From plotting PC1 and PC2, we can see Opposites then to have high PC 1 scores but low PC 2 scores suggesting they are quite controlled and high variance when playing offense. Setters are all over the place. Josh Tuaniga has very low PC1 but very high PC 2 pointing to his efficiency but risky plays. Outsides are more concentrated in PC2 but are spread out along PC 1. Taylor Sander and Taylor Crabb who now play beach volleyball have the two highest PC1 loadings for outsides which reinforces their ability to control the ball, whereas someone like Alex Nikolov (Moni’s brother) and TJ DeFalco are more risky and offensive/aggressive players.

Here, I’ve plotted them on a radar plot and we can see how these positions differ and how Moni’s skill distribution compares to theirs. Moni’s profile is actually quite different than the average Setter profile, his dig rate is lower but his service ace and error rate are all very high. This is probably attributed to his 6-9 frame which allows him to be more aggressive at the baseline and in offensive opportunities.

In conclusion, Moni ranked highly on this AI generated metric we calculated, led his team to a chip, and when compared to previous player of the years, he was ranked third. His long frame has enabled him to become one of the more aggressive Setters we’ve seen, earning him a huge Ace count (but also a huge serving error rate) in addition to block rate and kills. (There’s this clip of Andrew Rowan pushing the ball over and Moni hammering it down like it was a perfect set to him). The data backs up his award but at the end of the day, if you’ve watched his game, it doesn’t take much to see that this guy’s good, really good.

Appendix: All code for this report

library(ncaavolleyballr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gt)
# grabs player information for 2024 season
# system.time({
#   ncaavolleyballr::division_stats(
#     year = 2024,
#     sport = "MVB",
#     level = "teamseason",
#     save = TRUE,
#     path = "."
#   )
# })
  # ncaavolleyballr::division_stats(
  #   year = 2024,
  #   sport = "MVB",
  #   level = "teamseason",
  #   save = TRUE,
  #   path = "."
  # )
mvb_playerseason_div1_2024 <- read_csv("~/Documents/upskilling/Volleyball Analytics/data/mvb_playerseason_div1_2024.csv")

# ncaa_mvb_2024_player_statistics <- readRDS("~/Documents/upskilling/Volleyball Analytics/data/ncaa_mvb_2024_player_statistics.RDS")
mvb_playerseason_div1_2024 <- mvb_playerseason_div1_2024 |>
  janitor::clean_names() |> 
  select(team, conference, player, yr, pos, ht, gp:bhe, -c(tripl_dbl, ms)) |> 
    mutate(
    ht_cm = str_match(ht, "^(\\d+)-(\\d+)$"),      # extract feet + inches
    feet  = as.numeric(ht_cm[,2]),
    inches = as.numeric(ht_cm[,3]),
    height_cm = (feet * 12 + inches) * 2.54        # convert total inches → cm
    , .before = ht
  ) %>%
  filter(pos != "") |> 
  replace_na(list(total_attacks = 0,
                  kills = 0,
                  errors = 0,
                  r_err = 0,
                  block_solos = 0,
                  block_assists = 0,
                  b_err = 0,
                  pts = 0,
                  bhe = 0,
                  aces = 0,
                  assists = 0,
                  s_err = 0,
                  digs = 0 )) |> 
  select(-ht, -ht_cm, -feet, -inches)        

skimr::skim(mvb_playerseason_div1_2024)
# example of uneven distribution
ggplot(mvb_playerseason_div1_2024, aes(x = kills/s, fill = pos)) +
  geom_boxplot()
mvb_playerseason_div1_2024 |> 
  filter(player == "Moni Nikolov") |> 
  select(gp, s, kills: bhe) |> 
   pivot_longer(
    cols = everything(),
    names_to = "stat",
    values_to = "value"
  ) |> 
  mutate(rate = value / value[stat == "s"])
curr_data <- mvb_playerseason_div1_2024 |> 
  filter(s >= 30) |> 
mutate(
  kill_rate        = kills / total_attacks,
  error_rate       = errors / total_attacks,
  ace_rate         = aces / s,
  serr_rate        = s_err / s,
  dig_rate         = digs / s,
  rerr_rate        = r_err / s,
  block_rate       = (block_solos + block_assists/2) / s,
  block_error_rate = b_err / s,
  assist_rate      = assists / s,
  bhe_rate         = bhe / s
) |> 
  group_by(pos) %>% 
mutate(
  z_kill_rate        = scale(kill_rate)[,1],
  z_hitting_pct      = scale(hit_pct)[,1],
  z_block_rate       = scale(block_rate)[,1],
  z_dig_rate         = scale(dig_rate)[,1],
  z_assist_rate      = scale(assist_rate)[,1],
  z_ace_rate         = scale(ace_rate)[,1],
  z_serr_rate        = -scale(serr_rate)[,1],   # negative stat → invert
  z_rerr_rate        = -scale(rerr_rate)[,1],
  z_error_rate       = -scale(error_rate)[,1],
  z_bhe_rate         = -scale(bhe_rate)[,1]
) |> 
  mutate(
    off = 0.40*z_kill_rate     +
      0.40*z_hitting_pct   +
      0.20*z_assist_rate,
    def = 0.55*z_block_rate +
      0.45*z_dig_rate   +
      0.10*z_rerr_rate,
    srv = 0.60*z_ace_rate +
      0.40*z_serr_rate,
    pm = 0.70*z_assist_rate +
     0.30*z_bhe_rate,
    PIS = 0.35*off + 0.35*def + 0.20*srv + 0.10*pm
  ) |> ungroup()
curr_data |> 
  select(team, player, pos, PIS, z_kill_rate:pm) |> 
  arrange(desc(PIS)) |> 
    mutate(rank = row_number(), .before = 1) |>
  head(10) |> 
  gt::gt()
temp_data <- curr_data |> 
  select(team, player, PIS, z_kill_rate:pm) |> 
  arrange(desc(PIS)) |> 
    mutate(rank = row_number(), .before = 1) |>
  head(10)

 
team_standings <- map(
 mvb_teams |> filter(team_name %in% temp_data$team, yr == 2024) |> pull(team_id), 
 ~ {
    Sys.sleep(15)
    temp <- team_season_info(team_id = .x)
    
    c(temp$team_info$team_name, temp$record)
  },
  .progress = TRUE
)

team_results <- do.call(rbind,team_standings) |> data.frame() |> 
  rename(team_name = V1)

saveRDS(team_results, "ncaa_mvb_2024_team_results.RDS")

ncaa_mvb_2024_team_results <- readRDS("~/Documents/upskilling/Volleyball Analytics/data/ncaa_mvb_2024_team_results.RDS")

curr_data |> 
  select(team, player, PIS, z_kill_rate:pm) |> 
  arrange(desc(PIS)) |> 
    mutate(rank = row_number(), .before = 1) |>
  head(10) |> 
  left_join(ncaa_mvb_2024_team_results, by = join_by(team == team_name)) |> 
  select(rank, team, player, PIS, Overall.record
) |> 
  gt()

ncaa_poy_2009_2025 <- readxl::read_excel("~/Documents/upskilling/Volleyball Analytics/data/ncaa_poy_2009_2025.xlsx") |>
  janitor::clean_names() |>
  rename(hit_pct = pct)


skimr::skim(ncaa_poy_2009_2025)
curr_data <- ncaa_poy_2009_2025 |>
  filter(s >= 30) |>
  replace_na(list(total_attacks = 0,
                  kills = 0,
                  errors = 0,
                  r_err = 0,
                  block_solos = 0,
                  block_assists = 0,
                  b_err = 0,
                  pts = 0,
                  bhe = 0,
                  aces = 0,
                  assists = 0,
                  s_err = 0,
                  digs = 0 )) |>
mutate(
  kill_rate        = kills / total_attacks,
  error_rate       = errors / total_attacks,
  ace_rate         = aces / s,
  serr_rate        = s_err / s,
  dig_rate         = digs / s,
  rerr_rate        = r_err / s,
  block_rate       = (block_solos + block_assists/2) / s,
  block_error_rate = b_err / s,
  assist_rate      = assists / s,
  bhe_rate         = bhe / s
)
curr_data |> 
mutate(
  z_kill_rate        = scale(kill_rate)[,1],
  z_hitting_pct      = scale(hit_pct)[,1],
  z_block_rate       = scale(block_rate)[,1],
  z_dig_rate         = scale(dig_rate)[,1],
  z_assist_rate      = scale(assist_rate)[,1],
  z_ace_rate         = scale(ace_rate)[,1],
  z_serr_rate        = -scale(serr_rate)[,1],   # negative stat → invert
  z_rerr_rate        = -scale(rerr_rate)[,1],
  z_error_rate       = -scale(error_rate)[,1],
  z_bhe_rate         = -scale(bhe_rate)[,1]
) |> 
  mutate(
    off = 0.40*z_kill_rate     +
      0.40*z_hitting_pct   +
      0.20*z_assist_rate,
    def = 0.55*z_block_rate +
      0.45*z_dig_rate   +
      0.10*z_rerr_rate,
    srv = 0.60*z_ace_rate +
      0.40*z_serr_rate,
    pm = 0.70*z_assist_rate +
     0.30*z_bhe_rate,
    PIS = 0.35*off + 0.35*def + 0.20*srv + 0.10*pm
  ) |> 
  ungroup() |> 
  select(team, name, pos, year, PIS, z_kill_rate:pm) |> 
  arrange(desc(PIS)) |> 
    mutate(rank = row_number(), .before = 1) |>
  head(10) |> 
  gt::gt()
# ---- Select the variables for PCA ----
rate_vars <- c(
  "hit_pct", "kill_rate", "error_rate", "ace_rate", "serr_rate",
  "dig_rate", "rerr_rate", "block_rate", "block_error_rate",
  "assist_rate", "bhe_rate"
)

pca_df <- curr_data |>
  select(all_of(rate_vars)) |>
  drop_na()  # should be fine already, but just in case

# ---- Run PCA on scaled data ----
pca_res <- prcomp(pca_df, center = TRUE, scale. = TRUE)

# PCA scores (PC1, PC2, PC3, ...)
scores <- as.data.frame(pca_res$x)

summary(pca_res)

# # ---- K-means on the first 3 PCs ----
# set.seed(123)
# k <- 3  # you can try 2,3,4 and compare
# km <- kmeans(scores[, 1:3], centers = k, nstart = 50)
# 
# ---- Attach cluster + PCs back to curr_data ----
clustered <- curr_data |>
  mutate(
    PC1     = scores$PC1,
    PC2     = scores$PC2,
    PC3     = scores$PC3
    # cluster = factor(km$cluster)
  )


ggplot(clustered, aes(PC1, PC2, #color = cluster,
                      label = name, color = pos)) +
  geom_point(size = 3) +
  ggrepel::geom_text_repel(show.legend = FALSE) +
  theme_minimal() +
  labs(
    title = "PCA + k-means Clustering of POY Players",
    subtitle = "Clustering on first 3 PCs",
    color = "Cluster",
    shape = "Position"
  )


# cluster_profiles <- clustered |>
#   group_by(cluster) |>
#   summarise(
#     n_players = n(),
#     across(all_of(rate_vars), mean, .names = "mean_{.col}")
#   )
# 
# cluster_profiles


pca_res$rotation[,1:3]
library(fmsb)

# ---- z-score the rate vars ----
scaled_rates <- curr_data |>
  select(all_of(rate_vars)) |>
  mutate(across(everything(), scale))  # each column -> mean 0, sd 1

# Attach back to clusters
clustered_scaled <- curr_data |>
  select(name, pos) |>
  bind_cols(
    scaled_rates
  )

cluster_means <- clustered_scaled |>
  group_by(pos) |>
  summarise(across(all_of(rate_vars), mean), .groups = "drop")

# cluster_means

# Set reasonable radar bounds for z-scores
radar_max <-  2   # +2 SD
radar_min <- -1   # -2 SD

# Make a named vector of colors for clusters (optional)
cluster_cols <- c("OH" = "red", "OPP" = "blue", "S" = "darkgreen")

par(mfrow = c(1, 3))  # multiple panels side by side; or comment this out and do one by one

for (cl in unique(clustered$pos)) {
  cl_row <- cluster_means |>
    filter(pos == cl) |>
    select(all_of(rate_vars))

  # Build radar data frame
  radar_df <- rbind(
    rep(radar_max, length(rate_vars)),
    rep(radar_min, length(rate_vars)),
    cl_row
  )
  colnames(radar_df) <- rate_vars
  rownames(radar_df) <- c("max", "min", paste("Cluster", cl))

  radarchart(
    radar_df,
    axistype = 1,
    pcol = cluster_cols[cl],
    pfcol = scales::alpha(cluster_cols[cl], 0.4),
    plwd = 2,
    cglty = 1,
    cglcol = "grey80",
    caxislabels = c("-2", "-1", "0", "1", "2"),
    # caxislabels = c("-1", "0", "1"),
    title = paste("Cluster", cl, "Profile (z-scores)")
  )
}
player_name <- "Moni Nikolov"

player_row <- clustered_scaled |>
  filter(name == player_name) |>
  select(all_of(rate_vars))

radar_df_player <- rbind(
  rep(radar_max, length(rate_vars)),
  rep(radar_min, length(rate_vars)),
  player_row
)
colnames(radar_df_player) <- rate_vars
rownames(radar_df_player) <- c("max", "min", player_name)

radarchart(
  radar_df_player,
  axistype = 1,
  pcol = "purple",
  pfcol = scales::alpha("purple", 0.4),
  plwd = 2,
  cglty = 1,
  cglcol = "grey80",
  caxislabels = c("-2", "-1", "0", "1", "2"),
  title = paste(player_name, "Profile (z-scores vs POY group)")
)

Did Moni Nikolove Deserve to be NCAA Volleyball Player of the Year?