19 Feb 2017on ctf, computers

The Only Winning Move

DARPA's Cyber Grand Challenge took place this year right before DEFCON 24 and was a watershed moment for automated security analysis. All of the raw data from the competition has been out for awhile now, but I haven't seen anyone analyze it yet. So, I took a stab at it myself.

I was specifically interested in looking at how well teams actually performed. The fact is that the winning Cyber Reasoning System (CRS), ForAllSecure's Mayhem, didn't do anything for the last 30 or so rounds of the game due to Mayhem taking too long to receive data. So, I set out to try and answer what I felt was the most pressing question: Was the only winning move not to play?

A strange game @DARPACGC it seems the only winning move is not to play.
— Mayhem CRS (@MayhemCRS) August 5, 2016

Scoring Formula

To figure this out, we need to first understand exactly how the scores were calculated. This was described during the CGC stream, but they displayed two different, contradicting formulas (check out 20:35 and 22:35)! Since I clearly couldn't trust them, I wanted to find it in print. Turns out it's harder to find than you'd think.

It's not in the Finalist Event Information Sheet, it's not in the latest version of the CGC Rules (section 3.2.3 at least outlines some of the components of the scoring algorithm, though), it's not in the Event FAQ or the overall CGC FAQ, and I don't see it posted on either their website or GitHub. There is, however, a little graphic on the bottom of page 4 of the CFE Brochure that states the overall formula:

score = availability * security * evaluation * 100

...where score is the score for an individual challenge set (CS). Each of the scores (one per CS) is then added together to calculate the final score for that round. Each of the scores for each round is added together to calculate the final total score. Hilariously, all of this contradicts everything shown during the event.

How each of the component scores themselves are calculated is never explained. I'm guessing they kept the same overall framework from the CQE Scoring Document. The full explanation is sorta lengthy (see the document), but here's the gist:

The availability metric ensures that the service is available (not disconnected or using too many resources) and functional (not missing required features). It is scored on a scale of 0 to 1. If your service isn't available at all, you receive no points for anything. If your service is perfectly available, you get 1 point. If there are problems, you receive something in-between (see formula in document).
The security metric ensures that the service is not vulnerable. It is scored on a scale of 0 to 1. If your service is vulnerable to all reported proofs of vulnerability for that service, you receive no points. If your service is no longer vulnerable after patching you receive 1 point. If your service is partially vulnerable, you receive something in-between (see formula in document).
The evaluation metric grants bonus points for finding vulnerabilities. It is scored on a scale of 1 to 2. If you don't discover any vulnerabilities, you simply get 1 point. If you prove that a vulnerability exists, you get 2 points.

I believe there are only two marked differences between this explanation and what's published for the CFE (aside from the CQE using challenge binaries and the CFE using challenge sets). The first is that the security metric is now scored entirely based on PoVs from other teams and doesn't check if you've patched vulnerabilities the organizers intended (it's also now scored between 1 and 2). The second is that the evaluation metric is no longer simply 1 or 2 - it can be in-between if your PoV only works on some of the teams, rather than all of them.

Assuming the above formulas and explanations are correct, we can say that, for each challenge set in a round, a CRS doing everything right would receive 400 points (1 * 2 * 2 * 100 = 400). A CRS doing nothing, by contrast, would receive either 100 points (1 * 1 * 1 * 100 = 100) if an exploit exists, or 200 points (1 * 2 * 1 * 100 = 200) if no exploit exists. Multiplying each of these by the number of challenge sets in a given round will give us the score for that round, and adding all those together will give us theoretical total scores.

UPDATE (2017-02-19): While I'd been waiting for confirmation from my employer that I was clear to post this (since I had small involvement in our CGC effort prior to CQE), Shellphish released a write-up of their CGC experiences in Phrack. They confirmed everything I said above (and also provided some insights into why the availability score was so difficult for teams during the competition). It's a great paper and I recommend reading it after this post if you're interested in more information.

Scoring Data

Alright, so, let's move on to the actual scoring data. The data for each scoring round is provided as a 13GB tarball by DARPA. This tarball contains folders named cfe-export-bundle-N-T where N appears to be the round number and T appears to be a timestamp in epoch time. There are 96 folders in total, one for each round (not 95, like some of these articles would have you expect - the first round is round 0).

Inside each folder are two files: A blank file called flag and another tarball with the name cfe-export-bundle-N (where N is the round number again). Inside the tarball are two folders:

files, which contains a number of .ids and .rcb files along with .zips named csid_X-round_N (where N is the round and I have no idea what X is)
N (where N is the round number yet again), which contains a crs_data.csv and a score_data.json.

I wasn't sure what to do with the stuff inside of files, so I started with the other folder. Each crs_data.csv file contains a bunch of lines that look like this:

2016-08-04 15:56:02,lcp-red,lcp-cooling-power,8129.0
2016-08-04 15:56:02,lcp-red,lcp-water-temp-in,57.0
2016-08-04 15:56:02,lcp-red,lcp-water-temp-out,63.0
2016-08-04 15:56:02,lcp-red,lcp-water-valve,52.0

These appear to be the logs of what was happening with each system's hardware as the game was running. Makes sense, given that a hardware failure would obviously have ruined the game. I skipped over these since they didn't contain any scoring data.

The score_data.json files are really what we want to look at. They contain a huge dump of data from each scoring round with the following top-level keys:

round - The round number.
challenges - The IDs of the challenge binaries (CBs) scored during this round.
csid_map - The full mapping of CB IDs to CB name for the entire competition.
rank - The score for each team during this round.
teams - A full dump of all actions from each team.

I un-packed all the files with this BASH one-liner (it's 37GB of data when decompressed, if you were wondering):

cd cgc-submissions; for d in *; do cd $d; tar xf *.tar; cd ..; done

I then used Ruby to load the JSON data and do some analysis. By inspecting the challenges data, we can see that every round (with exceptions) had 15 challenge sets that were to be scored. The exceptions to this rule were:

Round 0 - 10 challenge sets
Round 1 - 13 challenge sets
Round 12 - 14 challenge sets
Round 92 - 14 challenge sets
Round 93 - 13 challenge sets
Round 94 - 12 challenge sets
Round 95 - 11 challenge sets

By inspecting the rank key, we can see what each team's score was in a given round. The mapping of team ID to team appears to be:

Team 1 - Galactica
Team 2 - Jima
Team 3 - Rubeus
Team 4 - Crspy
Team 5 - Mayhem
Team 6 - Mechaphish
Team 7 - Xandra

At first, I thought this data was what each team scored that round. That doesn't make any sense when you look at all the data. Instead, it appears to be what each team's total score was as of that round. To get a complete list of scores per round, you have to subtract out the score from the previous round.

Now that we know how many challenge sets there were and what each team's score was per round, the last piece of data we need is how many exploits existed in each round. For simplicity's sake, we'll assume that any successful proof of vulnerability (PoV) against any other CRS for a particular challenge set would work against a CRS doing nothing. To get this data, we need to loop through each team's pov_results. A successful PoV will be marked with a result of success (unsuccessful PoVs say failure).

Final Result

I implemented everything above in this Ruby script:

#!/usr/bin/env ruby

require "json"

scores = []
challenges = []
Dir["cfe-submissions/*"].each do |dir|
    File.open("#{dir}/#{dir.split("-")[4]}/score_data.json", "r") do |f|
        round = []
        json = JSON.load(f.read())
        i = json["round"]

        # get challenges
        challenges[i] = {}
        json["challenges"].each do |cs|
            challenges[i][cs] = false
        end

        # get scores
        json["rank"].each do |rank|
            round[rank["team"]-1] = rank["score"]
        end
        scores[i] = round

        # get successful exploits
        json["teams"].each do |team, data|
            data["pov_results"].each do |_, pov|
                if pov.keys.include?("result") and pov["result"] == "success"
                    challenges[i][pov["csid"]] = true
                end
            end
        end
    end
end

last = nil
wopr = []
perfect = []
puts "| Round | Galactica | Jima | Rubeus | Crspy | Mayhem | Mechaphish | Xandra | WOPR | Perfection |"
puts "| -------------------------------------------------------------------------------------------- |"
scores.each_with_index do |round, i|
    this = round.dup
    this.length.times { |j| this[j] -= last[j] } if last
    wopr[i] = 0
    perfect[i] = 0
    challenges[i].each do |id, pwned|
        if pwned then wopr[i] += 100 else wopr[i] += 200 end
        perfect[i] += 400
    end
    puts "| #{i} | #{this[0]} | #{this[1]} | #{this[2]} | #{this[3]} | #{this[4]} | #{this[5]} | #{this[6]} | #{wopr[i]} | #{perfect[i]} |"
    last = round
end
puts "| TOTAL | #{last[0]} | #{last[1]} | #{last[2]} | #{last[3]} | #{last[4]} | #{last[5]} | #{last[6]} | #{wopr.reduce(0, :+)} | #{perfect.reduce(0, :+)} |"

If you run it, it'll give you the following table (the "WOPR" column is our do-nothing CRS, while the "Perfection" column shows the theoretical maximum score for that round):

Round	Galactica	Jima	Rubeus	Crspy	Mayhem	Mechaphish	Xandra	WOPR	Perfection
0	2000	2000	2000	2000	2000	2000	2000	2000	4000
1	2579	2583	2600	2577	2581	2575	2583	2600	5200
2	2479	2091	2900	1650	2631	850	1700	2600	5600
3	2492	2617	2333	2119	2733	1742	2353	2600	6000
4	2196	2621	2535	2275	2242	2557	2357	2600	6000
5	2273	2617	2714	2266	2765	2173	2415	2600	6000
6	2279	2431	2466	2373	2556	2982	2781	2600	6000
7	2460	2480	2662	2162	3003	2979	2792	2600	6000
8	2433	2298	2258	2104	2998	2984	2583	2600	6000
9	1994	2232	2219	2324	3099	2979	2637	2500	6000
10	2344	2226	2623	2527	3066	2989	2948	2500	6000
11	2259	2120	2739	2515	2863	2979	2948	2400	6000
12	2372	2228	3058	2516	3065	2992	2959	2500	6000
13	2339	2217	3052	2512	3053	2979	2949	2500	6000
14	2148	2216	3089	2521	3065	2979	2953	2500	6000
15	2418	2135	3087	2534	3128	2986	2952	2400	6000
16	2718	2220	3095	2526	3063	2980	2956	2500	6000
17	2431	2131	3103	2537	3126	2779	2951	2400	6000
18	2412	2109	2937	2395	3104	2788	2742	2400	6000
19	2649	2119	2938	2387	3120	2787	2973	2400	6000
20	2310	2123	2643	2413	3118	2987	2971	2400	6000
21	2388	2363	2706	2672	3104	2699	2975	2400	6000
22	2190	2264	2571	2552	3219	2399	2879	2400	6000
23	2115	1967	2409	2331	2832	2428	2677	2300	6000
24	2133	2318	2142	2661	2964	2536	2624	2500	6000
25	2096	2359	2193	2733	2999	2388	2389	2600	6000
26	2429	2257	1687	2831	2963	2560	2588	2700	6000
27	2194	2503	744	2104	2925	2898	2323	2800	6000
28	2101	2600	1002	2564	2901	2605	2544	2800	6000
29	2404	2594	1353	2404	2915	2816	2741	2800	6000
30	2363	2621	2281	2430	2879	2627	2546	2900	6000
31	2373	2610	2281	2405	2871	2590	2557	2900	6000
32	2602	2420	2046	2591	2885	2978	2560	2900	6000
33	2415	2459	2628	2307	2823	2833	2597	2800	6000
34	2977	2468	2826	2252	2913	2685	2782	2800	6000
35	2798	2464	2629	2682	2956	2944	2740	2800	6000
36	2863	2631	2766	2581	2800	2612	2588	2900	6000
37	2937	2733	2769	2466	2999	2833	2885	2900	6000
38	2798	2799	2974	2644	3000	2800	2912	3000	6000
39	2995	2800	2972	2759	3000	2792	2914	3000	6000
40	2894	2900	2698	2764	2897	1048	2898	2900	6000
41	2898	2898	2874	2573	2900	1160	2696	2900	6000
42	2696	2900	2897	2774	2900	1600	2898	2900	6000
43	2851	2890	2897	2767	2896	2597	2889	2900	6000
44	2853	2897	2900	2599	2894	2931	2896	2900	6000
45	2857	2900	2900	2785	2900	2933	2898	2900	6000
46	2872	2899	2800	2092	2800	3069	2790	2800	6000
47	2564	2600	2800	2568	2800	3301	2397	2800	6000
48	2825	2749	2900	2570	2900	3120	2899	2900	6000
49	2786	2748	2900	2534	2800	2716	2885	2900	6000
50	2581	2748	2900	2834	2892	2830	2889	2900	6000
51	3068	2747	2900	2643	2891	2339	2698	2900	6000
52	2987	2744	2800	2749	2894	2738	2776	2900	6000
53	3006	2746	2998	2784	2900	2933	2974	2900	6000
54	3043	2743	2899	2769	2894	2541	2766	2900	6000
55	3030	2748	2800	2589	2900	2611	2986	2900	6000
56	3006	2735	2998	2536	2880	2787	2970	2900	6000
57	2817	2637	2898	2479	2788	2491	2774	2800	6000
58	2566	2693	2798	2661	2698	2698	2766	2700	6000
59	2629	2696	2698	2484	2696	2864	2548	2700	6000
60	2494	2685	2598	2363	2691	2797	2622	2700	6000
61	2645	2684	2798	2359	2678	2553	2782	2700	6000
62	2661	2679	2798	2362	2689	2751	2781	2700	6000
63	2571	2796	2900	2531	2786	2661	2823	2800	6000
64	2085	2599	2700	2267	2592	2666	2648	2600	6000
65	2742	2695	2800	2575	2692	2851	2556	2700	6000
66	2604	2597	2500	2257	2693	2855	2458	2700	6000
67	2620	2899	2700	2462	2796	2789	2636	2800	6000
68	2608	2891	2703	2578	2791	2579	2825	2800	6000
69	2811	2887	2701	2535	2794	2782	2833	2800	6000
70	2605	2875	2698	2572	2793	2982	2821	2800	6000
71	2622	2995	2800	2597	2900	2400	2841	2900	6000
72	2619	2992	2803	2469	2900	2956	2978	2900	6000
73	2726	2794	2203	1729	2700	2663	2877	2700	6000
74	2500	2800	2600	2124	2800	2231	2779	2800	6000
75	2533	2700	2600	1953	2800	2702	2979	2800	6000
76	2582	2762	2700	2362	2800	2736	2786	2800	6000
77	2588	2768	2700	2171	2800	2637	2791	2800	6000
78	2900	2900	2800	2630	2900	2772	2800	2900	6000
79	2900	2800	2700	2114	2800	2671	2999	2800	6000
80	2700	2800	2700	2538	2800	2672	2800	2800	6000
81	2518	2600	2500	2494	2700	2200	2518	2700	6000
82	2517	2613	2505	2386	2700	2800	2419	2700	6000
83	2620	2615	2506	2190	2700	2933	2569	2700	6000
84	2543	2612	2573	2421	2700	2533	2939	2700	6000
85	2509	2611	2422	2420	2700	2900	2931	2700	6000
86	2433	2612	2557	2388	2700	2333	2724	2700	6000
87	2833	2611	2706	2609	2700	2461	2933	2700	6000
88	2833	2613	2700	2602	2700	2449	2724	2700	6000
89	2832	2613	2716	2621	2700	2533	2939	2700	6000
90	2866	2614	2771	2666	2700	2933	2953	2700	6000
91	2866	2612	2769	2678	2700	3032	2956	2700	6000
92	2666	2413	2533	2478	2500	2866	2750	2500	5600
93	2482	2213	2499	2503	2300	2682	2550	2300	5200
94	2266	2012	2133	2377	2100	2466	2145	2100	4800
95	2049	1813	2082	2111	1800	2249	1954	1900	4400
TOTAL	247534	246437	251759	236248	270042	254452	262036	258500	568800

Yes. You are reading that correctly. A CRS that literally did nothing could have placed 3rd. Setting up absolutely no software in the CFE could have won you $750,000. It would seem that, while not playing wasn't a winning move, it could still have been worth some serious money.

I do have to mention two caveats, though:

Although I tried to figure out what the scoring algorithm was, I haven't been able to completely audit all the scores for the entire event. All the data above simply trusts that each round was scored appropriately by the organizers.
I believe it's possible that a given CRS may have submitted a PoV after all other CRSs had patched. As a result, there may be PoVs that didn't work on anyone (and, thus, weren't scored), but that would have worked on a system doing absolutely no patching. One could probably determine definitively if this is or isn't the case by taking all submitted PoVs and re-throwing them. Unfortunately, that would take a fair amount of effort and I don't have time for it right now.

Regardless, I think it's still pretty obvious just how far these systems have to go before they're taking my job or making the world a safer place.

CGC final scores — These scores look a little less impressive to me now.

Design Thoughts

To wrap up, I thought a little bit about the design of the competition and how this might be avoided. I think the problem here is that, unlike the CQE, there was no "oracle" whose score counted. The CQE had teams losing points if any of the "known" vulnerabilities weren't patched in addition to losing points for vulnerabilities other teams uncovered. Here, the only way to lose points is if other teams find vulnerabilities.

Given that the stated purpose of CGC was to "create automatic defensive systems capable of reasoning about flaws, formulating patches, and deploying them on a network in real time", this might seem odd. How can you appropriately test if a system was capable of these things if you don't at least score them on flaws you already know about?

I think the answer lies in what the CGC organizers intended the competition to be about. We already have generic mitigations. CGC was about building systems that could, in the most ideal case, fix every flaw with a targeted patch instead. In order to develop those targeted patches, you have to find and understand the vulnerability. Demonstrating true understanding requires not only a targeted patch that works, but also a proof of that vulnerability. Not having an oracle means the only teams that consistently score are those with both a patch and a proof.

I still feel the design of the game was solid. The fact that a do-nothing system would have scored well should speak to the difficulty of the problem space and the incompleteness of each system. Not poor game design. I also have a feeling that teams didn't place as much emphasis on offense and availability (where it seems most teams missed or lost points) given the uncertainty about the CFE rules.

I'm not sure if there will be another Cyber Grand Challenge. If there is, I would expect to see this become less and less of a problem as time goes on and systems improve.