Fuzyll


Hopelessly passionate husband, engineer, hacker, gamer, artist, and tea addict.


The Only Winning Move

DARPA's Cyber Grand Challenge took place this year right before DEFCON 24 and was a watershed moment for automated security analysis. All of the raw data from the competition has been out for awhile now, but I haven't seen anyone analyze it yet. So, I took a stab at it myself.

I was specifically interested in looking at how well teams actually performed. The fact is that the winning Cyber Reasoning System (CRS), ForAllSecure's Mayhem, didn't do anything for the last 30 or so rounds of the game due to Mayhem taking too long to receive data. So, I set out to try and answer what I felt was the most pressing question: Was the only winning move not to play?

Scoring Formula

To figure this out, we need to first understand exactly how the scores were calculated. This was described during the CGC stream, but they displayed two different, contradicting formulas (check out 20:35 and 22:35)! Since I clearly couldn't trust them, I wanted to find it in print. Turns out it's harder to find than you'd think.

It's not in the Finalist Event Information Sheet, it's not in the latest version of the CGC Rules (section 3.2.3 at least outlines some of the components of the scoring algorithm, though), it's not in the Event FAQ or the overall CGC FAQ, and I don't see it posted on either their website or GitHub. There is, however, a little graphic on the bottom of page 4 of the CFE Brochure that states the overall formula:

score = availability * security * evaluation * 100

...where score is the score for an individual challenge set (CS). Each of the scores (one per CS) is then added together to calculate the final score for that round. Each of the scores for each round is added together to calculate the final total score. Hilariously, all of this contradicts everything shown during the event.

How each of the component scores themselves are calculated is never explained. I'm guessing they kept the same overall framework from the CQE Scoring Document. The full explanation is sorta lengthy (see the document), but here's the gist:

  • The availability metric ensures that the service is available (not disconnected or using too many resources) and functional (not missing required features). It is scored on a scale of 0 to 1. If your service isn't available at all, you receive no points for anything. If your service is perfectly available, you get 1 point. If there are problems, you receive something in-between (see formula in document).
  • The security metric ensures that the service is not vulnerable. It is scored on a scale of 0 to 1. If your service is vulnerable to all reported proofs of vulnerability for that service, you receive no points. If your service is no longer vulnerable after patching you receive 1 point. If your service is partially vulnerable, you receive something in-between (see formula in document).
  • The evaluation metric grants bonus points for finding vulnerabilities. It is scored on a scale of 1 to 2. If you don't discover any vulnerabilities, you simply get 1 point. If you prove that a vulnerability exists, you get 2 points.

I believe there are only two marked differences between this explanation and what's published for the CFE (aside from the CQE using challenge binaries and the CFE using challenge sets). The first is that the security metric is now scored entirely based on PoVs from other teams and doesn't check if you've patched vulnerabilities the organizers intended (it's also now scored between 1 and 2). The second is that the evaluation metric is no longer simply 1 or 2 - it can be in-between if your PoV only works on some of the teams, rather than all of them.

Assuming the above formulas and explanations are correct, we can say that, for each challenge set in a round, a CRS doing everything right would receive 400 points (1 * 2 * 2 * 100 = 400). A CRS doing nothing, by contrast, would receive either 100 points (1 * 1 * 1 * 100 = 100) if an exploit exists, or 200 points (1 * 2 * 1 * 100 = 200) if no exploit exists. Multiplying each of these by the number of challenge sets in a given round will give us the score for that round, and adding all those together will give us theoretical total scores.

UPDATE (2017-02-19): While I'd been waiting for confirmation from my employer that I was clear to post this (since I had small involvement in our CGC effort prior to CQE), Shellphish released a write-up of their CGC experiences in Phrack. They confirmed everything I said above (and also provided some insights into why the availability score was so difficult for teams during the competition). It's a great paper and I recommend reading it after this post if you're interested in more information.

Scoring Data

Alright, so, let's move on to the actual scoring data. The data for each scoring round is provided as a 13GB tarball by DARPA. This tarball contains folders named cfe-export-bundle-N-T where N appears to be the round number and T appears to be a timestamp in epoch time. There are 96 folders in total, one for each round (not 95, like some of these articles would have you expect - the first round is round 0).

Inside each folder are two files: A blank file called flag and another tarball with the name cfe-export-bundle-N (where N is the round number again). Inside the tarball are two folders:

  • files, which contains a number of .ids and .rcb files along with .zips named csid_X-round_N (where N is the round and I have no idea what X is)
  • N (where N is the round number yet again), which contains a crs_data.csv and a score_data.json.

I wasn't sure what to do with the stuff inside of files, so I started with the other folder. Each crs_data.csv file contains a bunch of lines that look like this:

2016-08-04 15:56:02,lcp-red,lcp-cooling-power,8129.0
2016-08-04 15:56:02,lcp-red,lcp-water-temp-in,57.0
2016-08-04 15:56:02,lcp-red,lcp-water-temp-out,63.0
2016-08-04 15:56:02,lcp-red,lcp-water-valve,52.0

These appear to be the logs of what was happening with each system's hardware as the game was running. Makes sense, given that a hardware failure would obviously have ruined the game. I skipped over these since they didn't contain any scoring data.

The score_data.json files are really what we want to look at. They contain a huge dump of data from each scoring round with the following top-level keys:

  • round - The round number.
  • challenges - The IDs of the challenge binaries (CBs) scored during this round.
  • csid_map - The full mapping of CB IDs to CB name for the entire competition.
  • rank - The score for each team during this round.
  • teams - A full dump of all actions from each team.

I un-packed all the files with this BASH one-liner (it's 37GB of data when decompressed, if you were wondering):

cd cgc-submissions; for d in *; do cd $d; tar xf *.tar; cd ..; done

I then used Ruby to load the JSON data and do some analysis. By inspecting the challenges data, we can see that every round (with exceptions) had 15 challenge sets that were to be scored. The exceptions to this rule were:

  • Round 0 - 10 challenge sets
  • Round 1 - 13 challenge sets
  • Round 12 - 14 challenge sets
  • Round 92 - 14 challenge sets
  • Round 93 - 13 challenge sets
  • Round 94 - 12 challenge sets
  • Round 95 - 11 challenge sets

By inspecting the rank key, we can see what each team's score was in a given round. The mapping of team ID to team appears to be:

  • Team 1 - Galactica
  • Team 2 - Jima
  • Team 3 - Rubeus
  • Team 4 - Crspy
  • Team 5 - Mayhem
  • Team 6 - Mechaphish
  • Team 7 - Xandra

At first, I thought this data was what each team scored that round. That doesn't make any sense when you look at all the data. Instead, it appears to be what each team's total score was as of that round. To get a complete list of scores per round, you have to subtract out the score from the previous round.

Now that we know how many challenge sets there were and what each team's score was per round, the last piece of data we need is how many exploits existed in each round. For simplicity's sake, we'll assume that any successful proof of vulnerability (PoV) against any other CRS for a particular challenge set would work against a CRS doing nothing. To get this data, we need to loop through each team's pov_results. A successful PoV will be marked with a result of success (unsuccessful PoVs say failure).

Final Result

I implemented everything above in this Ruby script:

#!/usr/bin/env ruby

require "json"

scores = []
challenges = []
Dir["cfe-submissions/*"].each do |dir|
    File.open("#{dir}/#{dir.split("-")[4]}/score_data.json", "r") do |f|
        round = []
        json = JSON.load(f.read())
        i = json["round"]

        # get challenges
        challenges[i] = {}
        json["challenges"].each do |cs|
            challenges[i][cs] = false
        end

        # get scores
        json["rank"].each do |rank|
            round[rank["team"]-1] = rank["score"]
        end
        scores[i] = round

        # get successful exploits
        json["teams"].each do |team, data|
            data["pov_results"].each do |_, pov|
                if pov.keys.include?("result") and pov["result"] == "success"
                    challenges[i][pov["csid"]] = true
                end
            end
        end
    end
end

last = nil
wopr = []
perfect = []
puts "| Round | Galactica | Jima | Rubeus | Crspy | Mayhem | Mechaphish | Xandra | WOPR | Perfection |"
puts "| -------------------------------------------------------------------------------------------- |"
scores.each_with_index do |round, i|
    this = round.dup
    this.length.times { |j| this[j] -= last[j] } if last
    wopr[i] = 0
    perfect[i] = 0
    challenges[i].each do |id, pwned|
        if pwned then wopr[i] += 100 else wopr[i] += 200 end
        perfect[i] += 400
    end
    puts "| #{i} | #{this[0]} | #{this[1]} | #{this[2]} | #{this[3]} | #{this[4]} | #{this[5]} | #{this[6]} | #{wopr[i]} | #{perfect[i]} |"
    last = round
end
puts "| TOTAL | #{last[0]} | #{last[1]} | #{last[2]} | #{last[3]} | #{last[4]} | #{last[5]} | #{last[6]} | #{wopr.reduce(0, :+)} | #{perfect.reduce(0, :+)} |"

If you run it, it'll give you the following table (the "WOPR" column is our do-nothing CRS, while the "Perfection" column shows the theoretical maximum score for that round):

Round Galactica Jima Rubeus Crspy Mayhem Mechaphish Xandra WOPR Perfection
0 2000 2000 2000 2000 2000 2000 2000 2000 4000
1 2579 2583 2600 2577 2581 2575 2583 2600 5200
2 2479 2091 2900 1650 2631 850 1700 2600 5600
3 2492 2617 2333 2119 2733 1742 2353 2600 6000
4 2196 2621 2535 2275 2242 2557 2357 2600 6000
5 2273 2617 2714 2266 2765 2173 2415 2600 6000
6 2279 2431 2466 2373 2556 2982 2781 2600 6000
7 2460 2480 2662 2162 3003 2979 2792 2600 6000
8 2433 2298 2258 2104 2998 2984 2583 2600 6000
9 1994 2232 2219 2324 3099 2979 2637 2500 6000
10 2344 2226 2623 2527 3066 2989 2948 2500 6000
11 2259 2120 2739 2515 2863 2979 2948 2400 6000
12 2372 2228 3058 2516 3065 2992 2959 2500 6000
13 2339 2217 3052 2512 3053 2979 2949 2500 6000
14 2148 2216 3089 2521 3065 2979 2953 2500 6000
15 2418 2135 3087 2534 3128 2986 2952 2400 6000
16 2718 2220 3095 2526 3063 2980 2956 2500 6000
17 2431 2131 3103 2537 3126 2779 2951 2400 6000
18 2412 2109 2937 2395 3104 2788 2742 2400 6000
19 2649 2119 2938 2387 3120 2787 2973 2400 6000
20 2310 2123 2643 2413 3118 2987 2971 2400 6000
21 2388 2363 2706 2672 3104 2699 2975 2400 6000
22 2190 2264 2571 2552 3219 2399 2879 2400 6000
23 2115 1967 2409 2331 2832 2428 2677 2300 6000
24 2133 2318 2142 2661 2964 2536 2624 2500 6000
25 2096 2359 2193 2733 2999 2388 2389 2600 6000
26 2429 2257 1687 2831 2963 2560 2588 2700 6000
27 2194 2503 744 2104 2925 2898 2323 2800 6000
28 2101 2600 1002 2564 2901 2605 2544 2800 6000
29 2404 2594 1353 2404 2915 2816 2741 2800 6000
30 2363 2621 2281 2430 2879 2627 2546 2900 6000
31 2373 2610 2281 2405 2871 2590 2557 2900 6000
32 2602 2420 2046 2591 2885 2978 2560 2900 6000
33 2415 2459 2628 2307 2823 2833 2597 2800 6000
34 2977 2468 2826 2252 2913 2685 2782 2800 6000
35 2798 2464 2629 2682 2956 2944 2740 2800 6000
36 2863 2631 2766 2581 2800 2612 2588 2900 6000
37 2937 2733 2769 2466 2999 2833 2885 2900 6000
38 2798 2799 2974 2644 3000 2800 2912 3000 6000
39 2995 2800 2972 2759 3000 2792 2914 3000 6000
40 2894 2900 2698 2764 2897 1048 2898 2900 6000
41 2898 2898 2874 2573 2900 1160 2696 2900 6000
42 2696 2900 2897 2774 2900 1600 2898 2900 6000
43 2851 2890 2897 2767 2896 2597 2889 2900 6000
44 2853 2897 2900 2599 2894 2931 2896 2900 6000
45 2857 2900 2900 2785 2900 2933 2898 2900 6000
46 2872 2899 2800 2092 2800 3069 2790 2800 6000
47 2564 2600 2800 2568 2800 3301 2397 2800 6000
48 2825 2749 2900 2570 2900 3120 2899 2900 6000
49 2786 2748 2900 2534 2800 2716 2885 2900 6000
50 2581 2748 2900 2834 2892 2830 2889 2900 6000
51 3068 2747 2900 2643 2891 2339 2698 2900 6000
52 2987 2744 2800 2749 2894 2738 2776 2900 6000
53 3006 2746 2998 2784 2900 2933 2974 2900 6000
54 3043 2743 2899 2769 2894 2541 2766 2900 6000
55 3030 2748 2800 2589 2900 2611 2986 2900 6000
56 3006 2735 2998 2536 2880 2787 2970 2900 6000
57 2817 2637 2898 2479 2788 2491 2774 2800 6000
58 2566 2693 2798 2661 2698 2698 2766 2700 6000
59 2629 2696 2698 2484 2696 2864 2548 2700 6000
60 2494 2685 2598 2363 2691 2797 2622 2700 6000
61 2645 2684 2798 2359 2678 2553 2782 2700 6000
62 2661 2679 2798 2362 2689 2751 2781 2700 6000
63 2571 2796 2900 2531 2786 2661 2823 2800 6000
64 2085 2599 2700 2267 2592 2666 2648 2600 6000
65 2742 2695 2800 2575 2692 2851 2556 2700 6000
66 2604 2597 2500 2257 2693 2855 2458 2700 6000
67 2620 2899 2700 2462 2796 2789 2636 2800 6000
68 2608 2891 2703 2578 2791 2579 2825 2800 6000
69 2811 2887 2701 2535 2794 2782 2833 2800 6000
70 2605 2875 2698 2572 2793 2982 2821 2800 6000
71 2622 2995 2800 2597 2900 2400 2841 2900 6000
72 2619 2992 2803 2469 2900 2956 2978 2900 6000
73 2726 2794 2203 1729 2700 2663 2877 2700 6000
74 2500 2800 2600 2124 2800 2231 2779 2800 6000
75 2533 2700 2600 1953 2800 2702 2979 2800 6000
76 2582 2762 2700 2362 2800 2736 2786 2800 6000
77 2588 2768 2700 2171 2800 2637 2791 2800 6000
78 2900 2900 2800 2630 2900 2772 2800 2900 6000
79 2900 2800 2700 2114 2800 2671 2999 2800 6000
80 2700 2800 2700 2538 2800 2672 2800 2800 6000
81 2518 2600 2500 2494 2700 2200 2518 2700 6000
82 2517 2613 2505 2386 2700 2800 2419 2700 6000
83 2620 2615 2506 2190 2700 2933 2569 2700 6000
84 2543 2612 2573 2421 2700 2533 2939 2700 6000
85 2509 2611 2422 2420 2700 2900 2931 2700 6000
86 2433 2612 2557 2388 2700 2333 2724 2700 6000
87 2833 2611 2706 2609 2700 2461 2933 2700 6000
88 2833 2613 2700 2602 2700 2449 2724 2700 6000
89 2832 2613 2716 2621 2700 2533 2939 2700 6000
90 2866 2614 2771 2666 2700 2933 2953 2700 6000
91 2866 2612 2769 2678 2700 3032 2956 2700 6000
92 2666 2413 2533 2478 2500 2866 2750 2500 5600
93 2482 2213 2499 2503 2300 2682 2550 2300 5200
94 2266 2012 2133 2377 2100 2466 2145 2100 4800
95 2049 1813 2082 2111 1800 2249 1954 1900 4400
TOTAL 247534 246437 251759 236248 270042 254452 262036 258500 568800

Yes. You are reading that correctly. A CRS that literally did nothing could have placed 3rd. Setting up absolutely no software in the CFE could have won you $750,000. It would seem that, while not playing wasn't a winning move, it could still have been worth some serious money.

I do have to mention two caveats, though:

  1. Although I tried to figure out what the scoring algorithm was, I haven't been able to completely audit all the scores for the entire event. All the data above simply trusts that each round was scored appropriately by the organizers.
  2. I believe it's possible that a given CRS may have submitted a PoV after all other CRSs had patched. As a result, there may be PoVs that didn't work on anyone (and, thus, weren't scored), but that would have worked on a system doing absolutely no patching. One could probably determine definitively if this is or isn't the case by taking all submitted PoVs and re-throwing them. Unfortunately, that would take a fair amount of effort and I don't have time for it right now.

Regardless, I think it's still pretty obvious just how far these systems have to go before they're taking my job or making the world a safer place.

CGC final scores
These scores look a little less impressive to me now.

Design Thoughts

To wrap up, I thought a little bit about the design of the competition and how this might be avoided. I think the problem here is that, unlike the CQE, there was no "oracle" whose score counted. The CQE had teams losing points if any of the "known" vulnerabilities weren't patched in addition to losing points for vulnerabilities other teams uncovered. Here, the only way to lose points is if other teams find vulnerabilities.

Given that the stated purpose of CGC was to "create automatic defensive systems capable of reasoning about flaws, formulating patches, and deploying them on a network in real time", this might seem odd. How can you appropriately test if a system was capable of these things if you don't at least score them on flaws you already know about?

I think the answer lies in what the CGC organizers intended the competition to be about. We already have generic mitigations. CGC was about building systems that could, in the most ideal case, fix every flaw with a targeted patch instead. In order to develop those targeted patches, you have to find and understand the vulnerability. Demonstrating true understanding requires not only a targeted patch that works, but also a proof of that vulnerability. Not having an oracle means the only teams that consistently score are those with both a patch and a proof.

I still feel the design of the game was solid. The fact that a do-nothing system would have scored well should speak to the difficulty of the problem space and the incompleteness of each system. Not poor game design. I also have a feeling that teams didn't place as much emphasis on offense and availability (where it seems most teams missed or lost points) given the uncertainty about the CFE rules.

I'm not sure if there will be another Cyber Grand Challenge. If there is, I would expect to see this become less and less of a problem as time goes on and systems improve.