Things ChatGPT Struggles To Do — New Years Edition
TL/DR; always use the correct tool for the job.
As one does, in the 2020s, my wife and I spent part of our holidays entertaining ourselves by asking ChatGPT random questions. One — seemingly simple — question turned into a rabbit hole deeper than I expected and there’s either a lesson (or maybe a reminder) in here for us all.
Can you please tell me the day of the week for every January 1st for the past 100 years.
You can watch ChatGPT struggle here yourself in the shared chat: https://chat.openai.com/share/d57d3f17-84ff-4b36-a625-707fd7f65164
First, it couldn’t even list the past 100 years, instead skipped every 11 years and went way into the future.
Then it did it, but the data was wrong. Way wrong. When spot-checking random days and asking ChatGPT, it admitted it was wrong, and would tell me the correct days for a subset of the years in the list. When we asked it for the updated list, back to broken it went.
We asked ChatGPT which day of the week was most common for the past 100 years. It was WAY off, saying Sunday occurred 26 times (it actually only occurred 14 times).
I couldn’t get my head wrapped around the absurdity of the distribution of days, despite asking ChatGPT to explain it and a second time in greater detail, so I simply asked it to write some python to generate the answer.
The python was flawless, with perfect data, and a much more even distribution of January 1st days of the week, as common sense would expect:
1923: Monday, Common Year
1924: Tuesday, Leap Year
… snip …
2022: Saturday, Common YearDay Counts Summary:
Sunday: 14
Monday: 14
Tuesday: 15
Wednesday: 14
Thursday: 14
Friday: 15
Saturday: 14
Ultimately, ChatGPT struggles with this because it has to retrieve these answers from a precomputed model, which clearly are not there. It cannot simply compute these — discrete algorithms and well understood libraries will continue to rule there.
Tying this back to Cybersecurity …
ChatGPT failed gloriously to answer a similar question in a different chat (which I unfortunately cannot share here): Given an endpoint security alert indicating that the following was executed, is this safe or dangerous? rundll32.exe msrating.dll,RatingSetupUI
ChatGPT wrote amazing prose explaining that rundll32.exe is a Microsoft signed executable, which is by its nature safe, and in this case is calling a Microsoft DLL (msrating.dll
) with the RatingSetupUI
entry point, which is known to be for the Microsoft Content Advisor.
ChatGPT: All is safe in the world, please go back to sleep.
The astute will instantly recognize the problem: We don’t know the full path of rundll32.exe
nor do we know the full path, nor hashes or digital signatures for msrating.dll
so we cannot state for certain if these are in fact the authentic Microsoft binaries or fake ones. They could be in %appdata%
for all we know, rather than in C:\Windows\system32
. Bottom line: there is investigation work to be done to rule out whether this is a live event worth engaging incident response or a benign false positive.
Why can’t ChatGPT make a better decision?
First, it’s based on data in its model. ChatGPT basically is a reflection somewhere between a random person on the street and the troves of documents it has slurped up into its neural network. Decisions that require deep domain expertise require a deeply tuned model, which doesn’t exist.
Second, if you ask 10 people off the street if the above endpoint security alert is dangerous or benign, it’s a coin toss whether they’ll be correct or not. If you ask them to justify why their answer is correct, you’ll likely get zero correct participants to your survey (unless, if maybe you’re at a security conference).
Takeaways
- This is yet another blog about AI and security. Sorry for that, but this one is mine. I’ll refund your purchase price if you don’t like it.
- LLMs will absolutely be a line of demarcation — before LLMs and after LLMs, both enterprise and consumer grade applications will never be the same. Those who don’t keep up will be left behind.
- But at the same time, discrete algorithms and COMPETENT human intuition are still superior for making important decisions — especially security decisions.