Ep 301: Anthropic Claude 3.5 Sonnet – How it compares to ChatGPT’s GPT-4o

Episode Categories:

Comparative Review of Anthropic's Claude 3.5 Sonnet and OpenAI's ChatGPT GPT-4o

In the rapidly-evolving tech landscape, two AI tools, Anthropic's Claude 35 Sonnet and OpenAI's ChatGPT, have been evaluated as a means to transform podcast transcriptions into engaging newsletter content. Despite achieving the necessary turnaround time, discrepancies in tone and mimicking capabilities were found between the two. This necessitates a deeper understanding of how crucial a role these tools play in augmenting business capabilities.

Claude 3.5 Sonnet vs GPT-4o: Format vs Content

While both AI models showed prowess in mimicking the desired newsletter style, issues regarding proper formatting surfaced. The AI tools' inability to adhere to the example formats — notably bullet-pointing in Anthropic and subpoints creation by ChatGPT — was found lacking. The creative aspect of engaging and compelling content, a staple of powerful newsletters, also remained elusive to the AI machinery.

Claude 3.5 Sonnet Artifact Feature

Perhaps one of the most exciting developments is the advent of the artifacts feature. This invaluable tool allows users to execute a function on one side and subsequently render the outcome on the other — a boon for coding and visualization. Through continuous modifications, the speed and efficacy of producing code for a website layout are enhanced significantly.

Anthropic's Claude 3.5 Sonnet vs OpenAI's ChatGPT GPT-4o

Side by side comparisons reveal insightful details about the capabilities of the two AI giants. Although Claude's increased speed and powerful vision model are commendable, ChatGPT's ability to triumphantly handle large spreadsheets showcases its edge. Notably, the lack of real-time internet access in Claude presents a significant obstacle, highlighting the importance of dynamic data in contemporary business scenarios.

AI Handling of Common Problems

The evaluation of AI models, based on their ability to interpret and respond to commonplace issues, yielded mixed results. Regardless of correct responses for trick questions, both models tested poorly in generating solutions to creative prompts and adhering to given instructions, raising questions about the models' scope in complex and ambiguous scenarios.

Brainstorming and Vision Prompts

Both models were engaged in a brainstorming activity to build a new company brand for a futuristic device. Following this, vision prompts for identifying food items and a photograph's location were conducted, further highlighting the shortcomings and strengths of the AI tools.

Advancements In AI

Recent AI advancements have made it increasingly important to evaluate and invest in the right AI tools, understanding their capacities and shortcomings. While Claude 35 SONET shows promise with its powerful vision model and speed, OpenAI's GPT retains its prowess, maintaining supremacy in data processing and interactivity.

These AI assistive tools can transform the way businesses function, making them essential for productivity and growth. With constant evolution and advancements, understanding the right tool to employ plays a crucial role in ensuring business success in the competitive digital landscape.

Topics Covered in This Episode

1. Overview of Anthropic Claude 3.5 Sonnet
2. Claude 3.5 Sonnet Artifacts Feature
3. Comparison of Claude 3.5 Sonnet to GPT-4o
4. Real World Examples and Use Cases

Podcast Transcript

Jordan Wilson [00:00:16]:
I'm actually going to be using Claude now. Yes. Yes. Yes. The same guy that loves ChatGPT. I'm actually taking a look at Anthropic's new Claude 35 SONET. It's really good. So, yeah, Anthropic just released its new model, and I can't lie.

Jordan Wilson [00:00:36]:
It is super impressive. But there's one missing feature that keeps it from being an actual ChatGPT killer, and there's one new Claude feature that I think is so good. I recommend that everyone uses it. It's going to, I think, change how we interface with large language models. Alright. Could be going over that today and a lot more on everyday AI. What's going on y'all? My name is Jordan Wilson, and I'm the host of everyday AI. And we're a daily livestream podcast and free daily newsletter, helping everyday people like you and me learn and leverage generative AI.

Jordan Wilson [00:01:16]:
So, today's episode, I'm extremely excited about. I'm so excited. I'm actually gonna be in the comments with you live. So, yeah, this is technically a prerecorded one. I'm actually gonna be on the road, going to chair a panel at the AI, Chicago Summit. So if you're there, hit me up. Let me know. But let's just get straight into it.

Jordan Wilson [00:01:40]:
And quick reminder, yeah, we normally go over the AI news. I don't wanna bring you stale news. Our newsletter is still gonna be fresh to death. So make sure you go to your everydayai.com. Sign up for that free daily newsletter. We'll still be bringing you, the AI news for today. Everything else, that's happening around the web that, is that matters, to generative AI. But let's let's get into it.

Jordan Wilson [00:02:04]:
Let's get into it right now and talk about the new Claude 35 SONET, and we're going to see live. Yes. We're gonna do some live, head to head comparisons that I'm super stoked about. I haven't even done these. So I'm gonna be doing these, live with you all. So, yes, if you are listening on the podcast, make sure to check out the show notes because this is one of those ones. I'm gonna do my very best, to explain this, on the podcast with my words, but it's probably gonna be one that you're gonna wanna watch because we are gonna be doing head to head, side by side. We're gonna let, these 2 new models duke it out.

Jordan Wilson [00:02:42]:
You know, anthropics model, we're gonna talk about here that, here in a second, but it is now the, most powerful model that there is according to their benchmarks. So, we gotta see how it, how it does in a little head to head battle. Alright. Let's get straight into it, shall we? Alright. So I'm gonna go ahead and, kind of first, we're gonna do a, little bit of a walk through, and I'm gonna tell you what's new, in 3.5. Then I'm gonna tell you what I like and what I don't like. Then I'm gonna give you kind of my thoughts on this, large language model race that we have going on right now. And then last but not least, we're going to do the head to head.

Jordan Wilson [00:03:24]:
Alright. So with that, let's start at the top, and let's just go ahead and, talk a little bit about what's new, inside Claude. So, Claude 3 Sonnet. Here we go. So I have the, announcement page, here up for our livestream audience, but this is pretty pretty straightforward stuff. So, we're gonna talk about this here in a second, but, you know, Claude 3 launched with 3 different variations. So it was Haiku, which was, their least capable model, SONNET, which was kind of their middle model, and then OPUS, which is their, largest or most capable model. So that was until a couple days ago when they released 35 SONNET.

Jordan Wilson [00:04:07]:
So right now, it is just Claude 3 haiku, small one. Claude 3 opus, powerful one. But now the middle one is actually the most capable or intelligent one that's getting the highest benchmark. So more on that here in a second. So keep that in mind. It's actually, Anthropic did this to their middle model. They came out with this, iterative 3.5. And the other ones, you know, Quad Haiku, the small one, still 3.

Jordan Wilson [00:04:34]:
And then the previous, most powerful one, quad opus, is still on 3. So keep that in mind. It's actually their middle model that is the most capable. So, you know, just kind of reading off the press, press releases right here. So we haven't done all of these things, you know, to give them kind of a fair shake, see if this is actually what's happening. But, Claude says that, 35 Sonnet is twice the speed. Alright. And here's the important part.

Jordan Wilson [00:05:03]:
Here's what people are going to be talking about, and you're gonna be seeing this chart that's on my screen probably a lot. If you follow AI news, if you follow large language models, generative AI, you're gonna be seeing this chart a lot. So, essentially, this is a benchmarking chart. And you see, kind of at the top, we have Claude 35 SONNET. Their previously strongest, model, which is Claude 3 Opus. You have GPT 4 o, and then you have Gemini 1.5 pro, and then you have Llama, 400 b. Alright. So couple couple of things here.

Jordan Wilson [00:05:40]:
Couple of things to, to to to keep in mind. These are anthropics benchmarks, but usually if a company puts out benchmarks like this, they are fairly accurate. I just wanted to put that out there that, you know, these are Claude's benchmarks. These aren't a third party's benchmarks. But, essentially, what these benchmarks show is that Claude 3 son sorry. Claude 3.5 SONNET is blowing every other model out of the water. Both their previous Claude's previous, most powerful model, Claude 3 Opus, as well as gbt4o from OpenAI, Gemini 15 from Google, and llama 400 b, which is their early snapshot from Meta. That's important to keep in mind, that production model was actually not out yet.

Jordan Wilson [00:06:24]:
So this is just what, Meta kind of, shared, previously. So those are not, the llama 400 b is not publicly out. The other models all are. So, essentially, what we are seeing in this chart is SONET is much higher than everyone. Right? Probably the one that we talk about the most often and I think is the, the most telling is the MMLU. And it's actually interesting, because it that is the only metric that it is still tied with GPT 4 o. But what is interesting here is for Claude 35 SONNET, essentially, they have the exact same score. So this is the only one, the only, benchmark that Claude 35 Sonnet was not quote, unquote winning by itself.

Jordan Wilson [00:07:14]:
So it's in a tie for 1st according to Anthropic here, with an 88.7, on this MMLU versus GPT 4 o's same score, 88.7. However, kind of 2 different testing methodologies. So, Claude 3, 5 Sonnet had a 5 shot, whereas, GPT 4 o had a zero shot but chain of thought. So kind of 2 different prompting methodologies here. So that's why I think we have to wait until we have a 3rd party kind of go through and do this to see when we're comparing apples and apples. But I mean, aside from that, the the the new Claw 35 SONET, pretty pretty far ahead. Alright. A couple other, kind of updates.

Jordan Wilson [00:08:00]:
So, Anthropic is saying, that it's, the new vision model inside of Quad 3 5 Sonnet is much more powerful. And then there's some, some of these testing scores that, use vision, essentially showing, 3 5 Sonnet way far way far ahead of not just everyone else, aside from 1, simple, again, on an MU, which is kind of the multimodal or multimedia version of that other test, where GPT 4 0 is still ahead. But everything else, I mean, it's pretty significant here. You know, even just SONNET over Opus. I mean, one that really stood out to me, which is, you know, Claude 3 was very far behind in this, was visual math reasoning where the previous version of Claude 3 was a 50.5, whereas GPT 4 was a 638. Gemini was 639. And now so, I mean, just shot up in that one. Claude 3 SONNET went up to 677.

Jordan Wilson [00:09:04]:
So, you know, we'll we'll leave this in the in the show notes so you can go, check it out. But I'd say this right here is the biggest feature, which I love. It is the, artifacts. So, we'll probably I'll probably try and show you an example of that one here live, so you can check that out. But essentially, artifacts is, you know, where you can perform a function on the left. This really works for more coding and visualization. So you can do something on the left. You can generate code or something on the left, and it will render it on the right, which I think is really really cool.

Jordan Wilson [00:09:47]:
Alright. So that is essentially I'm gonna go ahead and swap my screen a little bit here. So we're gonna go ahead and do a little bit of this live. I wanna show you at least that one feature. I think it is very impressive. Alright. Let's see if we have, the right one here. Alright.

Jordan Wilson [00:10:11]:
We do. Perfect. There we go. Alright. So let's go ahead. Nope. That's not the right one. Sorry y'all.

Jordan Wilson [00:10:18]:
Give me a second. Hey. When we do this live when we do this live, you know, this is unedited, unscripted. Sometimes it takes a hot second. Alright. Cool. So let's just go ahead. We'll do something fun.

Jordan Wilson [00:10:29]:
Here's what I'm gonna do. So I'm just gonna go to, the everyday AI website here. Alright. So I'm gonna take a little screenshot of this website, our home page. Alright? I'm gonna click save. I'm gonna save that little screenshot. Now I am jumping back in Claude, and I'm just, I'm just gonna say, I don't like how Claude forces you to put something in to get their normal, to get their normal layout. Okay.

Jordan Wilson [00:10:58]:
Anyways, so this new artifacts feature, I love. So a couple other things to keep in mind, this new version, 3 5 Sonnet, is free, which is amazing, but it is very throttled. So, but if you have a paid account, much more generous rates. So, you can go into your account. You can go to feature preview. Okay? And then you just need to toggle this artifacts feature on. Okay? So, now that I did that I previously did that. So now all I'm gonna do is I'm gonna upload this screenshot that I just took of the everyday AI website of the home page.

Jordan Wilson [00:11:38]:
Alright. So now I'm just saying in, to Claude, so I I have 35 SONNET here at the bottom. Hopefully, y'all can see that. So I'm just gonna say to Claude, I'm gonna say please recreate this as a front end design. I think Claude should be a should do a pretty good job of doing this. So right away, we kind of saw a a code generator come up, and I can click to open, to the component here on the right hand side. So now I can see on the right hand side all the code going. Alright.

Jordan Wilson [00:12:14]:
Interesting. Hey. In this example that I tried here live, it didn't work. So I tried it again. It didn't work. What a bummer. Alright. That's okay.

Jordan Wilson [00:12:24]:
So I'm I'm just gonna go ahead and say, generate a, front end website design, for a company called Everyday AI. Alright. We'll just we'll just do it this way, a little better just so you can see this new feature. Hopefully, have it working. We'll see if it actually does. Yeah. The first one didn't work. That's why I do these live.

Jordan Wilson [00:12:50]:
Nothing's nothing's scripted. Nothing's edited. So let's actually see if this one, if this one works here. I think that would that would be super helpful. Alright. There we go. Perfect. So at least now you can see, kind of kind of what this looks like.

Jordan Wilson [00:13:05]:
So for the podcast audience, I essentially said, hey. On the front end, generate me some, generate me some code, right, for a landing page design. Right? So I can click over here and you'll see this code. Alright. So it looks like it wrote some, some React. So we have all of our code here. But the cool thing is, which I love, you can preview this in real time. So I'm looking, I mean, wow.

Jordan Wilson [00:13:32]:
This is this is good. I have what looks like a fully functioning website. And the thing that I love is you can modify it just with your text. Right? So it's nothing great. It says everyday AI. It says AI solutions for everyday challenges. There's a little, button. It says our solutions.

Jordan Wilson [00:13:50]:
There's a contact. There's a header. There's a footer. There's a hamburger menu, because I'm in, probably, like, mobile mode here. It's super impressive. It works. Right? I'm clicking on this mobile menu, and it works. It renders in real time, which is amazing, but the thing I love is you can talk to it iteratively.

Jordan Wilson [00:14:08]:
So I can say, like, looks great. I'm gonna say, let's change the, blue color to a darker blue, and make the button a turquoise blue. I should probably spell turquoise correctly. I don't think it matters. And then I'm gonna say, let's change the logo. Right? So simple things. Right? This might be something if you've ever worked in web design. You probably do a lot of back and forth in this, but it takes days or dozens of hours to to do this kind of stuff.

Jordan Wilson [00:14:47]:
Sometimes, especially 15, 20 years ago to make these kind of changes, would take a very long time. So now on the right hand side, it is changing, the code. I'm watching it do it live very fast. Alright. So let's see how it did here. So didn't do the best. Alright. So I did ask for a main blue hero section, but it's not blue anymore.

Jordan Wilson [00:15:11]:
It's kind of a whitish gray color. It did change the logo, so we got that right. I asked for turquoise blue as the button. Didn't get it. So, I'm I'm curious. I'm gonna go ahead and click retry to see if it, to see if it gets it correct. I I really wanted this one to work. So, you you know, brand new feature.

Jordan Wilson [00:15:31]:
Maybe it's a little finicky. Maybe just the, it's the the the model itself. So again, I'm not sure if it's the model itself that's not correctly generating the code or if it's something with this new artifacts feature. I don't wanna take too long because we still gotta get our head to head, but bam. There we go. Hey. The second time, it actually worked well. So I'm glad I clicked regenerate.

Jordan Wilson [00:15:51]:
Y'all generative AI is generative. That's what people don't understand a lot of times. Right? Sometimes, you know, it's essentially next token prediction, but it's, like, insanely good next token prediction. So I regenerated, and now we have everything that I asked for. I asked for a dark blue background. It's dark blue. I asked to make the the button turquoise. It's turquoise.

Jordan Wilson [00:16:11]:
I asked for a new logo, new logo. Again, we have a working menu. It's it's actually a pretty slick looking, very simple website. Looks good. Alright. So now now what we're gonna do, y'all, let me tell you real quick. So I told you what's new. Right? So with the new model, it's benchmarking off the chart, the this artifacts.

Jordan Wilson [00:16:31]:
I'm gonna tell you what I love. Well, I love the artifacts. I think it is a great feature. I think it is going to change how we interface, with large language models. If I'm being honest, though, technically, OpenAI and ChatGPT had this first with the GPT builder because you could talk to the GPT builder conversationally on the left hand side, and then on the right hand side, it would generate your GPT that you could preview, and you could work with it iteratively. So it would essentially generate code, render it on the right hand side. So, technically, OpenAI had a feature similar to this, but, with Anthropic here, they did a little, some different things. So generating code, visualization, some basic graphics, which is really cool generating SVGs for example.

Jordan Wilson [00:17:17]:
So this I think is going to be something that we're gonna see out of all large language models. It is too good to not be included in the future of work, period. Right? Just something it seems simple. Right? But being able to render websites, visualizations, etcetera, super impressive. So that's what I like. What I don't like and the one reason the one reason why I will still not be using, Anthropics Quad 35 or 35 Sonnet as my main model. And why I would recommend, you probably shouldn't either. Right? You should go in, play with it.

Jordan Wilson [00:17:54]:
Right? But it is not connected to the Internet. Y'all, I cannot emphasize that enough. Huge, huge miss. Huge miss from Anthropic Claude. And, you know, I'm curious because 2 of their, you know, biggest investors, I believe, are Amazon and Google. So those are 2 companies that should be able to help them with that. I I I I do not understand for the life of me why Anthropics' claw does not allow you to connect to real time data. For 95% of our use cases, and I would say 95% of everyone's use cases, you need to have real time Internet access.

Jordan Wilson [00:18:40]:
If you don't, you cannot be confident about the outputs. Right? The the the most important thing when you're using a large language model is is it correct? Do I feel, do I feel confident in the output? Yes. It does have, I believe, a very up to date, I think April 2024 knowledge cutoff. So it's not like we're working with data that's 2 years old. But y'all, pretty soon, the data is gonna get old. It's gonna get stale. And even think of the last 3 months. Think of how much the world has changed, how much your business has changed, how much your industry has changed in the last 3 months.

Jordan Wilson [00:19:14]:
You cannot be using a model like this for business purposes, for day to day business purposes. You can't. I wish I could because I actually really like it, but I cannot and I cannot recommend anyone else unless you're using it with a third party API that tax you know, taps into, you know, an Internet connection. You know, perplexity as an example. I switched over my perplexity account from gbt4 o to Claude, 35 Sonnet. That's a no brainer. Right? But working with Claude out of the box, I still cannot recommend it for the majority of people. If you're using it for very specific use cases or if you're manually bringing that data in all at all times, sure.

Jordan Wilson [00:20:00]:
But that's a lot of data to bring in. Right? And you have to bring in data, in a static way. Right? You can't bring in dynamic data into Claude either. So, huge downside and still, it's it's such a big upside with the model and the artifacts. Amazing. But the downside, I think, is too big to recommend that people use this day to day for any business purposes. Again, unless you're trying to get it to, you know, oh, write some copy or something like that, that you don't necessarily need, you know, up to date information, or you're gonna be bringing it in yourself. Hey.

Jordan Wilson [00:20:31]:
I'm sorry. That's just the reality. Alright. So now let's get ready. Let's have some fun. Hopefully, this works. So you'll you'll you'll have to bear with me y'all. So let me go ahead and explain to you guys, what we're gonna do now, and I hope this works.

Jordan Wilson [00:20:49]:
Alright. So we are gonna be doing a live side by side battle. Alright? So we have Claude 35 Sonnet, and we have GPT 4 o. Alright? So I'm gonna go ahead and share my window here. Hopefully, we can, let's see there. Alright. Bam. And and and we're in it.

Jordan Wilson [00:21:18]:
And we're in it, y'all. Alright. So let's get this thing going. So we are gonna go 1 at a time. 1 at a time. And let me just tell you a couple things. So if you are joining us, on the podcast or even if you're joining us live, I have to say a couple of things. Number 1, this is not a scientific test.

Jordan Wilson [00:21:42]:
Alright? This is not scientific. Alright? This is not definitive. I just wanted to be able to do some real life use cases. Okay? But I am only testing similar capabilities that the models share. Right? So I'm not gonna be asking it questions that require active information because Claude cannot do that. You know, I can't be asking it about things that have happened, you know, with, oh, the, you know, NVIDIA and Microsoft and Apple. They've been going back and forth, and, you know, I can't, you know, ask it current event questions, pop culture, because it doesn't know anything. It doesn't know anything recent.

Jordan Wilson [00:22:23]:
So I'm only, kind of comparing, capabilities that the models share. Also, I'm 0 shotting all these prompts. Right? Copying and pasting them in, which is if you want actual good outputs, you should never do this. Right? You always need to go through prompt engineering basics. I recommend, you do a methodology, similar to our prime prompt polish, which is a free course we do usually multiple times a week. So, hey, if you've taken it before, if you're joining us live, if you're still around, let people know if you like the course or not. If you wanna access, just type in PPP. I'll get around to it.

Jordan Wilson [00:23:01]:
I swear. Alright. Enough jibber jabber. Let's go. A couple other things. I'm starting with a simple prompt here just to kind of set the stage for both models. Alright? So, essentially, what I'm saying, and we're gonna zoom in here and hopefully everyone can see. Alright.

Jordan Wilson [00:23:21]:
We are gonna be working in one chat window the whole time. Why that's important is, well, it's technically keeping all of this in its context. Right? So I'm not gonna ask it really, aside from maybe once, to draw on something that was a previous, that was previously in the context of this chat. But, again, this is also not how you would normally want to be using a large language model. You would go in, use it for one specific purpose, one specific purpose only, and then go back to that chat whenever you need it. Right? We are not doing that here. Alright? We are doing this just for the sake of easy copy and paste. Alright.

Jordan Wilson [00:24:00]:
So with that, let's go ahead and get started. So what I did to get this kind of prompt started, I said, for this chat, please respond with proper formatting and structured bullet points. Do not waste words and answer in the shortest way possible while still being detailed enough to fill the request and answer the questions. Do not answer in vague or general statements. Please take your time before each response. Go step by step and do your best to answer each question to the best of your abilities. Are you ready to start? Alright. Essentially, large language models have the tendency to be overly verbose.

Jordan Wilson [00:24:33]:
Right? They're gonna jam in like so many words. We don't need that. We're just doing comparison, so I'm telling both models, just give me the answers. You don't gotta take me on a journey here, y'all. Alright. Here we go. Let's go with the first one. We are gonna start with some logic questions.

Jordan Wilson [00:24:48]:
Alright. So, hopefully, I mean well, I guess we'll see if, if they get it right. So I'm gonna try to always hit enter at the same time. So what I'm saying is I said, I just woke up today with 6 apples and 3 bananas. Yesterday, I ate a banana and 2 apples. This morning, I will eat 1 apple and no bananas. However, I don't really like apples, and 1 banana may turn brown tomorrow. Assuming nothing else changes, how many apples and bananas will I have tonight? So, obviously, I threw in some irrelevant information to see if I could throw the models off.

Jordan Wilson [00:25:20]:
The correct answer should be 5 apples and 3 bananas. So let's see, let's see how they did. Alright. So it looks like I threw off both of them. Alright. Both of them answered wrong. They said tonight you will have 3 apples and 2 bananas. So I threw them both for a loop.

Jordan Wilson [00:25:45]:
The only thing that matters is that I woke up today with 6 apples and 3 bananas. So yesterday, it's both of them did the same. It I kind of tricked them. Right? So I I put in some information about yesterday because, well, it said yesterday, and it's subtracting that from today's total. Well, no. Yesterday, I actually had, 4 bananas and 8 apples. Right? I'm saying today, I woke up with this many apples. I'm eating 1 apple, so it should have been 6 apples minus 1, 5 apples.

Jordan Wilson [00:26:16]:
I'm not eating any bananas. They both got got it wrong. Alright. So, generally, we can do some, little bit of prompt engineering, and it probably would have done a little bit better. But out of the box, again, we're just comparing 35 SONNET, with, with GPT-4 o. I'm gonna try as well. Sorry y'all. I know hey.

Jordan Wilson [00:26:38]:
If if you're joining us on the podcast, I'm sorry. This is one of those ones we don't really edit this, but I do wanna keep a a running track of score to see who's winning. So so far, we have it is 0 to 0. That makes it super easy to keep score here. Alright. Next question. Let's go. We're gonna get a little faster going from this.

Jordan Wilson [00:27:01]:
Alright. So this is some of these are common problems that large language models generally, get wrong. So I'm saying here a man and his dog are standing on one side of the river. There's a boat with enough room for 1 human and one animal. How can a man get across with his dog in the fewest number of trips? Alright. So both of them got it wrong. Both of them said the same thing. Again, this is a very famous, problem that large language models always get wrong.

Jordan Wilson [00:27:29]:
So it says trip 1, the man crosses the river with his dog. Trip 2, the man, returns home alone. Trip 3, the man crosses the river again with the dog. Yeah. It should be one trip. Each of them said, said 3 trips. So, there we go. We're starting off with 2 wrong.

Jordan Wilson [00:27:46]:
But, again, these are common types of problems that large language models got wrong. The reason why I'm doing this still is, well, I wanna see if Claude, 35 Sonnet that just came out, I haven't tried these, if they if it finally figured these out. And I do have to figure that eventually these models are gonna get these because all of these models are now or sorry. All of these common problems are on the Internet. They're being scraped. So, I'm actually not sure, why Claude 35 Sonnet didn't get these correctly because the knowledge cutoff is April 2024. Right? You know, GPT 4 o knowledge cutoff October 2023. Maybe these weren't on the Internet, but these are I mean, you can find these in hundreds of places on the Internet, so I'm surprised that I got it wrong.

Jordan Wilson [00:28:32]:
Alright. Next logic question. We're still doing some logic here. Alright. This is if it takes 3 days sorry. I'm I'm gonna zoom in here. Sorry. I said if it takes 3 hours to dry 10 t shirts in the sun, how long will it take to dry 30 t shirts in the sun? The correct answer should be 3 hours.

Jordan Wilson [00:28:53]:
It's the same amount of time. It doesn't triple it, so let's see if it got it. So both of them, Claude and, ChatGPT, were not fooled on this one and got them both correct. Alright. Finally, large language models. You were having us worried for a second. Alright. Another simple again, we're going with some simple logic, kind of some brain teasers here.

Jordan Wilson [00:29:16]:
So let's go ahead. We're doing our next one. Our next one is if you have a single match and you walk into a room with an oil lamp, a candle, and a fireplace, which do you light first? Right? You light the match first. So both of them got this correct. Alright. Good. Yeah. Again, we're going with some trick questions.

Jordan Wilson [00:29:38]:
You know, some people might say, you know or you might think or maybe some of the older models might say, oh, you know, you would do the candle first. Right? Or, you know, I don't know. Alright. So got that right at least. Our next question. And hey. Live stream audience, I'm gonna give you the question first. I want you to guess this.

Jordan Wilson [00:29:55]:
Let's see if if we're all smart smarter than a large language model. What color is an airplane's black box? I'm gonna take a sip. Everyone joining this live, please get your get your, guesses in right now. Alright. Here we go. Let's see. I really hope they get it right. Alright.

Jordan Wilson [00:30:19]:
So Chad GPT got it right, and Airplane's black box is actually bright orange. Alright. And, Claude got it correct as well. Alright. So they both got it correct. Fantastic. And we have 2 more logic questions. People wanted logic questions, so I'm giving you all logic questions.

Jordan Wilson [00:30:38]:
Alright. Our next one is I'm saying, please give me 7 jokes that end in the word blue. 2 should be about animals. 3 should be about some other topic in the body of this chat, and you can make up the other 2. Alright. So let's see how these worked. Alright. So they both okay.

Jordan Wilson [00:31:02]:
They both sorted it out. They said animal joke, animal joke. Claude got blue, blue. Alright. Alright. And let's see. Chat gbt. Good.

Jordan Wilson [00:31:13]:
I personally like this joke better from, chat gbtees that said, what do you call a sad polar bear? A burr blue. Alright. Okay. So here we go. I'm seeing here 2 okay. They got 2 of them about something else in the chat. Oh, no. I said 3 should be about some other topic in the body of this chat.

Jordan Wilson [00:31:38]:
Okay. So it looks like Claude did, and they all 3 end in oh, they don't end in blue. So Claude already failed. This one ends in the word burn. Let's see if alright. So, OpenAI is actually pulling in some information from somewhere else, that it shouldn't be pulling in. Let's see here. So, yeah, it's it's it's pulling in some, some information I had from a different chat.

Jordan Wilson [00:32:11]:
So but it still does say blue. That one, not really. Alright. So they both they both failed. Alright. So they both failed. They're both getting I mean, they kinda got some of it right, but they both failed. It wasn't all blue.

Jordan Wilson [00:32:26]:
Alright. And then last and, actually, because of ChatGPT's memory and how it works, it's pulling things in from previous chats from other chats, which, it shouldn't have done. Alright. So now a little bit of a brain teaser here. So I'm saying a box is locked with a 3 digit numerical code. All we know is that all digits are different. The sum of all digits is 9, and the digit in the middle is the highest. What is the code? Alright.

Jordan Wilson [00:33:01]:
So, Claude says the only possible combination is 1 +7+1 equals 9. So it says 172. Right? Okay, middle middle number's the highest. 3 can't be the same. All add up to 9, but that is not right. So 172, those digits do not add up to 9. Those digits add up to 10. So Claude got that wrong.

Jordan Wilson [00:33:27]:
Alright. Let's look at ChatGPT here. So it's going through step by step. Again, we instructed the models in the beginning to have this kind of step by step problem solving logic, to hopefully make, give them both the highest likelihood of getting this correctly. Alright. So, Chad GPT got it correct. No. It didn't.

Jordan Wilson [00:33:49]:
It got it wrong. So it said 126, middle digit is the highest, which that is not. So they both got it wrong. They got it wrong in different ways. Alright. Next one y'all. Let me know. Let me know what you think.

Jordan Wilson [00:34:03]:
Are are are you surprised or not surprised with this? Alright. Here we go. So I'm gonna say our next, little quiz here. So now we're going into some brainstorming. So, my my prompt here is generate unique and creative marketing, advertising strategies to grow the everyday AI podcast. Do not suggest general or run of the mill ideas. Only pitch clever advertising and marketing tactics to specifically grow the everyday AI podcast. Alright.

Jordan Wilson [00:34:35]:
So there's no right or wrong here. So this is going to be, kind of just judging, but more or less, I'm just seeing if it if it did the if it followed the directions. Right? I said do not suggest general or run of the mill ideas. Only pitch clever advertising and marketing tactics. So I'm just looking. Did it at least follow directions or and did it not hallucinate? That's what I care about. Right? Okay. So it looks like Claude as an example said, okay.

Jordan Wilson [00:35:04]:
AI generated episode teasers, virtual AI cohost challenge, AI powered listener q and a, augmented reality podcast experience, AI ethics dilemma game. Okay. So those are fine. They're not really marketing or advertising necessarily. These look like features of different shows. Alright. Let's see, how ChatGPT did. So it said monthly AI puzzles, AI art contest, custom episode recommendations, personalized episode.

Jordan Wilson [00:35:40]:
It did the same with a guest AI cohost, which I actually just did, with hour 1 last week. So make sure to go check that out. So, pretty pretty similar. I would say I'm not super impressed by kind of the prompt following because it gave a lot of just very general. So, it's not not that I'm failing either of these here, but, you know, nothing really blew me away. So, no one's getting right or wrong. We're just kind of getting passed. So, we're getting passes there.

Jordan Wilson [00:36:11]:
Alright. So let's keep it going. Our next one here in our brainstorming. Alright. So our next one, I'm saying create a new company. Yeah. Let's see how it does here. Create a new company and brand for a future smart home device.

Jordan Wilson [00:36:26]:
This will solve a problem that does not currently exist. To start, come up with the company's name and its 1st flagship product. Give the product a name, brand and campaign, go to market strategy, tagline, and rationale for why it will work. Respond in a succinct way, keeping responses to short bullet points, but with ultra specific facts. Alright. So one thing I mean and this is more of a Claude, thing. It can't really respond with formatting, which stinks. Personally, I find Claude hard to read, for that reason.

Jordan Wilson [00:36:57]:
I don't know if if it's my eyes or what. I like to have different weights of bold, different text styling. You don't get that, really in Claude. You get that in chat gbt. That's a small thing. Alright. So let's just go over some of the, some of the things quick. So, chat gbt said company name is Quantum Haven, and the flagship product is the SleepSync Sphere.

Jordan Wilson [00:37:22]:
Alright. And then, Claude has the neural dwell, and the product is a mood morph. So mood morph is a wall mounted AI powered emotional environment adjuster. Let's see if it gave us the problem that does not exist. K. I don't really see anything there. Alright. Let's look at the motto at at least.

Jordan Wilson [00:37:42]:
It says live in sync with your feelings. Alright. Didn't really give us what problem it's solving, but it did give us everything else. It gave us a branding campaign, go to market strategy, and the rationale behind some of these things. Alright. So chat gbt, quantum haven, sleep sinks sleep sink sphere. That is a mouthful. Alright.

Jordan Wilson [00:38:02]:
So it harmonizes household sleep cycles by analyzing and optimizing environmental factors. What's interesting here is they kind of created similar products. Okay. And branding campaign, it says dream in harmony. Visuals. Let's see visuals channel. So we have our branding campaign, our go to market strategy, our tagline, which is sink your sleep, transform your life rationale. Here, we at least have the problem solved.

Jordan Wilson [00:38:30]:
That was kind of a big part of it. Claude just didn't address that. The whole point of this was to, you know, create a product that solves a problem that doesn't exist. So I'm I'm not gonna give it a pass fail, but I will say, Claude didn't pass very well. I'll give them both a pass, but, ChatGPT did a little better. Alright. Here we go. This is where my screen sharing might get a little crazy, so let's hope this works okay.

Jordan Wilson [00:39:00]:
So now what we're doing is I'm saying, please tell me what this chart is, list everything by category, and give me 13 of the best ways that everyday AI could use these things. So here's let's see if I can bring this up here. Alright. Actually, this is running long. We're gonna skip this one. We're gonna go to the next one. Alright. Because these ones are kinda similar.

Jordan Wilson [00:39:22]:
Alright. So this one I have I have a picture here. It's just some random food items. Right? But, it's kinda hard to see, even with the human eye when you zoom in. So what I'm asking it to do here, let's just go ahead. I just wanted everyone to be able to see that. So I'm I'm asking it to please identify, what these food items are, save the data in an organized spreadsheet. So, again, something that people don't know about large language models is they can see.

Jordan Wilson [00:39:52]:
They have computer vision capabilities, which is, you know, obviously a pretty, a pretty important thing for for models to have. So I'm gonna go ahead and drop, this in here. I'm gonna go ahead and drop the food label in Claude. So the good thing is in both of them, you can just drag and drop. So I'm saying, please identify what these food items are. Save the data in an organized spreadsheet that I can download. After that, please create a JSON file with all the structured data. Alright.

Jordan Wilson [00:40:22]:
I'm kind of shooting them off at the same time, so we'll also see which ones kinda go quicker. Alright. So both of them going pretty quick. Let's see if one of them, finishes before the other. Okay. So ChatGPT created the spreadsheet first. It looks like okay. I did not know this actually.

Jordan Wilson [00:40:45]:
Again, I haven't used, this new Claude 35 SONNET too much. So it cannot it says, Claude does not have the ability to run the code it generates yet. However, it literally does. We saw that. But for whatever reason, it did not save a downloadable spreadsheet. Okay. But Quad did go ahead and do the JSON. However, it only was able to recognize 3 different items.

Jordan Wilson [00:41:14]:
Okay. Interesting. There were 10 items in the photo. So it only got 3 of them, and it did not get a lot of the ingredient I I I guess I didn't ask for ingredients. So what I said is save the data in an organized spreadsheet. So I didn't ask it to, I probably should do that. So I'm gonna go ahead and run this one more time. I didn't wanna rerun these.

Jordan Wilson [00:41:40]:
I'm gonna say, please I'm gonna say, please, list all of the ingredients. All of the not the ingredients. The nutritional information. So I'm gonna say, please list all of the nutritional information in the chart and the JSON file. Alright. So let's go ahead and try to run this one more time. Alright. I didn't wanna have to do these twice, but it is what it is y'all.

Jordan Wilson [00:42:14]:
Alright. So let's go ahead and try that one more time and see if we can get it. So on the first pass, let's see how Chet gbt did. Similarly, yeah, they didn't I didn't really tell it what to do, enough. So let's go ahead and look. So same thing here. It looks like Claude was only able to identify, 3, 3 of the items, and it did not get much of the nutritional, content. Let's see here.

Jordan Wilson [00:42:49]:
K. Wow. So ChatGPT got almost all of them. However, I'm looking. I do think it made up. Okay. We got it looks like we got some hallucinations. It did get all of the content, so we got some of them exactly right, but some of this was not, was not items that were in there.

Jordan Wilson [00:43:20]:
So, like, okay, as an example, the diced tomatoes, yeah, those were in there and it looks like it got all of the nutritional information correct. The pasta unknown was correct, but, I don't think we had green pea green peas. I think this information is correct. I think those were pumpkin seeds. I said granola bars, Nature Valley. I don't think we had those either. Now I'm looking at the photo here. Yeah.

Jordan Wilson [00:43:48]:
We didn't. We had it was an elevation bar. So I think actually chat gbt did a little bit better, but they actually both failed. But, you know, Claude did a pretty terrible job if I'm being honest. Right? You know, Anthropic was was really, adamant that their, you know, their vision was so improved, and it didn't look improved at all. Alright. We're gonna do one more, one more vision prompt here, and then we're gonna get going. I know this is a longer episode y'all, but, hey.

Jordan Wilson [00:44:14]:
I asked in my newsletter. I said, what do you guys wanna see? And this is what you all wanted to see. So, I'm gonna give give the people what they want, I guess. Alright. So now I'm saying I have a photo. I'm gonna show the photo here to our livestream audience. Nothing crazy here. Just a simple simple photo here.

Jordan Wilson [00:44:33]:
So it's just the Chicago skyline, but we're on 9094. That's the highway that we're on right there. Alright. So we're driving near Chicago. You should be able to tell that it is Chicago. Chicago has some iconic skyscrapers as well as, you know, the computer vision should be able to see these signs. Right? So it should know, California Avenue, California Avenue, Diversey Avenue, Fullerton Avenue, the accent numbers. It should be able to see those things, and it should know.

Jordan Wilson [00:45:02]:
Alright. So let's go ahead and see if it gets it right. So I'm saying, please identify where this picture is located, what direction the photo is facing, and every other detail that you can make out. Alright. So, we're giving them here a second. Alright. So let's see how it did. Alright.

Jordan Wilson [00:45:31]:
So Claude says, picture is in Chicago. The photo was taken on a highway or expressway leading into downtown Chicago. The photo, it says, is facing southeast. Alright. It's it's talking about the traffic, the road signs, the street lights, the skyline. Okay. Pulls out the Willis Tower, formerly the Sears Tower. So pretty good.

Jordan Wilson [00:45:54]:
Not great. I would have liked it to identify the highway. That's pretty important when I say where is this located. That's the exact same first thing. All it says is Chicago, Illinois. It should have been pretty easy, to know that this should have been 90.94. Alright. So let's see how ChatGPT did.

Jordan Wilson [00:46:13]:
Alright. So it says city, Chicago, Illinois, facing south. Landmark, same thing. Willis Tower. Road, it says likely an expressway or major highway leading into downtown Chicago. So heavy traffic. It's talking about things. The signs, environmental details, vegetation.

Jordan Wilson [00:46:32]:
Alright. Here we go. Hey. It got it right. Good job, Chad GBT. So it said, it appears to be an expressway likely i90, i94. So it got it right. So both of them got it right.

Jordan Wilson [00:46:44]:
We're seeing a pattern here. Both of them are getting it right, but Chad GPT is getting it more correct. Right? So that one there, Claude didn't know that that was 9094. It should have known. It's one of the most, you know, I don't know, 10 most popular highways in the United States, that that there was plenty of telltale signs that, that's what it should have been. But didn't get it but didn't get it wrong. So alright. Now let's go ahead and get our next one here.

Jordan Wilson [00:47:14]:
I know this episode's going a little long y'all. Thanks for sticking around. I hope you're enjoying this. Like I said, if you are listening to this on the podcast, this is one of those you might wanna go just watch the replay if you wanna see these things. If you don't wanna do all this for yourself, I did all the prep work for you. Alright. So now let's go ahead and let's make sure I get the right thing here. Here we go.

Jordan Wilson [00:47:34]:
Alright. So now we're gonna be uploading a spreadsheet. We're gonna be doing some data visual, data visualization. Okay. So alright. I gotta make sure I get this key in here. This one this one's a little tricky. Alright.

Jordan Wilson [00:47:49]:
Let me explain what we have going on here. Alright. So I'm gonna be uploading a spreadsheet. Okay. So and I have some direction, some instructions. So I said, this is a huge dataset. It's a publicly available dataset, hundreds of thousands of rows of data. And I'm saying identify the 10 best selling video games worldwide from 1971 to 2024.

Jordan Wilson [00:48:13]:
For each game, provide the total sales, the regions where it was sold the most, and its critics score. Create a spreadsheet with that information and visualize the data in a graph. Alright. And then I give it a key so it knows what the data is, and then I'm gonna go ahead and, upload. I'm gonna go ahead and upload the file there. So I uploaded it to, chat gbt. I'm gonna try to get these going at the same time if I can. Alright.

Jordan Wilson [00:48:41]:
I'm gonna go ahead now and drop that in. There we go. Alright. And I'm gonna go ahead and click go. I gave Sonnet Claude Sonnet a little bit of a head start. Alright. Interesting. So it already says, your message will exceed the length limit for this chat on Claude.

Jordan Wilson [00:49:01]:
Bummer. Alright. So I have to start a new conversation in Claude. That is a week. That is week, Claude. Why? If you have that long 200 k memory, like, why why can't I actually, use it? That's interesting, Claude. Alright. Let's regardless, let's, go ahead and give it a try.

Jordan Wilson [00:49:20]:
So now you know, g p t 4 o. I'm still working in the same chat. I just had to start a new chat in Anthropic Quad. I don't know why. Maybe too many files. I'm not sure. Let's see. So now it says alright.

Jordan Wilson [00:49:37]:
Bummer. So, alright, well, I guess Claude cannot even handle this spreadsheet. Alright. Interesting, Claude. Kinda disappointing. Let's see how Chet GPT did. I mean, it's a huge spreadsheet, but if if anthropic. If if if one of your thing is you're talking about this large context window, I I should be able to upload large files.

Jordan Wilson [00:50:03]:
Right? Even if it is a, document, a spreadsheet with tens of thousands of rows of data, I should be able to do that. Let's see how ChatGPT did. I'm gonna go ahead and download this file here. Alright. And let's go ahead and look at the visualization. Alright. Let's go ahead. It looks like it's giving me Alright.

Jordan Wilson [00:50:34]:
So it looks like we got an error from chat gbt. However, it did oh, no. There it is. Oh, okay. Okay, chat gbt, gbt 4 o. So not only did it actually give me the CSV that I downloaded, I'm looking at this here. There we go. There it is.

Jordan Wilson [00:50:56]:
Alright. But now, also, it gave me a a a visualization. Yeah. So, apparently, Grand Theft Auto, Grand Theft Auto is number 1. Grand Theft Auto, Vice City is number 2. Call of Duty, whatever whatever. It got it right. We broke anthropic.

Jordan Wilson [00:51:14]:
Alright. So this is the first time I'm saying definitively that Claude got it wrong, Chet GPT got it right. The last couple, Claude got it kinda right, Chet GPT got it kinda better. But there we go. Claude failed, Plato. Not not not even close. Alright. So I'm also gonna go back.

Jordan Wilson [00:51:37]:
Alright. We got a couple more here. Let's go ahead and finish up. I think we just got 1 or 2 more. Alright. Well, actually, the next one, we can't even do, we can't even do inside of Claude. I'll just do it inside of ChatGPT to really see if we can do it. So I said create a line graph showing the total video game sales per year from 1971 to 2024.

Jordan Wilson [00:52:01]:
This is actually really hard. And then I said highlight any significant spikes or drops in sales and provide possible explanations for these trends based on historical events or industry changes. This is a lot. So now we see here, we have the Python code that the advanced data analysis version 2 is running inside of GPT 4 o. Super powerful, data analysis model. So it's going through it super quick. My gosh. That's good.

Jordan Wilson [00:52:32]:
My gosh. We just went through literally tens of thou it might even be more than a 100000 rows of data. I couldn't even upload it into Claude even though it has this super big, context window. Couldn't upload it. Look at this graph, y'all. This is oh my gosh. This graph is interactive. Look at this.

Jordan Wilson [00:52:52]:
I am hovering over this, and it's showing me year by year trends. Y'all, my gosh. I am blown away by this. This is a freaking huge spreadsheet. Huge. Right? I just I just wanna show some of y'all. Let's just go ahead, let's just go ahead and show this y'all, because this is look at this. Look at all this data.

Jordan Wilson [00:53:22]:
Look at this. This is so much. Let's see how many rows. How many rows of data is this? And chat gbt just ate it for breakfast. My gosh. Let's see. 64,000 rows of data, columns a through m. So again, yeah, that is, 100 of 1000 of cells.

Jordan Wilson [00:53:46]:
Chat gbt just ate it for breakfast. Cloud couldn't even take it. Cloud couldn't take it. Alright. Anyways, let's keep going. Alright. Next one. This is our last one, y'all.

Jordan Wilson [00:53:54]:
Alright. So this one, I think, is probably where Claude is going to shine, but let's see. I actually have 2 files that I need to upload, so, bear with me for a second. Alright. And then I'll explain to you exactly what we're doing. Let's go ahead and upload these files, and then I'm gonna get it going at the same time. And hopefully, we can watch these go side by side. Alright.

Jordan Wilson [00:54:20]:
Also, let me put this out there. I'm a human. I write my newsletter, but I just wanted to do this as an example. So I see yeah. All the other, AI newsletters, they all brag about how AI writes my newsletter. I spend no time on it. Guess what? I spend a stupid amount of time writing the newsletter. I'm a human.

Jordan Wilson [00:54:39]:
I write the newsletter. Former journalist, I don't do this, but I just wanted to do this as an example. So I'm saying for this chat, you will turn a podcast transcript of me, Jordan, the host of Everyday AI, talking about the AI news that matters. So show yesterday, and turn it into choppy and engaging newsletter copy. I've attached examples of previous newsletters and how they should be written as well as my most recent podcast transcript. Please write a newsletter for the attached transcript, mimicking the style as closely as possible to the examples given. The priority is to write the newsletter in the exact same format, tone of voice, and style as the examples, but for this episode with the attached transcript, please complete this task. Alright.

Jordan Wilson [00:55:26]:
So we're gonna do it side by side here. Alright. Ready? And we'll see who is faster and who is better. Alright. So essentially, again, giving examples of the newsletter. I say these are how the newsletter usually is. Then I say, here's the transcript. Write it for this transcript.

Jordan Wilson [00:55:44]:
Alright. So we'll see I'm gonna see who actually finishes first. So they're going both pretty quickly. Let's see who finishes first. Alright. They're both pounding out content. They literally both finished at the exact same time. How is that even possible? Alright.

Jordan Wilson [00:56:06]:
Let's take a look at which one did better. Alright. So let's see here. On the left yeah. I can yeah. Claude Claude cleaned up here. Claude cleaned up. Alright.

Jordan Wilson [00:56:22]:
So it says, is Chad GPT in trouble? Question mark. Anthropics Claude just dropped 3 5 Sonic with artifacts. Speaking of dropping, NVIDIA fell. Yeah. So it's it's actually it's kind of taking my, transcript almost verbatim because I actually said those nearly, same words. But that's fine. That's fine. Let's actually look at the body.

Jordan Wilson [00:56:44]:
Right? So, in our Monday newsletters, we essentially break down our main news stories. And the first part is a breakdown, and then the second part is, like, what it means. So let's see how it did. So, let's just kinda compare 1 by 1. This isn't meta, isn't it? So okay. Anthropic got this part wrong. So the first story was actually runway, gen 3 alpha, not anthropic. So I'm wondering if anthropic, hallucinated here.

Jordan Wilson [00:57:17]:
Let's see here. I'm looking at the different ones. Okay. No. It didn't. Okay. So Anthropic just put it in a different order, which is okay. But, again, it's not what I wanted.

Jordan Wilson [00:57:30]:
I mean, the the the OpenAI or sorry. The GPT 4 o, the content tone is is garbage. Right? So it starts with, hey there, AI enthusiast. That's what you always get out of ChatGPT by default. You know, one thing I think it's clear, in terms of, you know, writing engaging copy, Out of the box, anthropic Claude is is better, out of the box. ChatGPT is great. You just have to work with it. So, you know, overall, let's see kind of the we'll just do the comparison here of Anthropic.

Jordan Wilson [00:58:02]:
So, okay. I mean, neither of them, if I'm being honest, so they didn't really follow the correct format. I'm looking at yeah. So what Anthropic did is it just bullet pointed everything, which is not how I normally do it. And then ChatGPT kind of, just created different subpoints per each category, which is also something I don't do. But let's let's look at the what it means and see if we can, you know, find any difference. So let's see what it means. Okay.

Jordan Wilson [00:58:40]:
So Anthropic says it's a slugfest between Anthropic and OpenAI, and poor Google Gemini can't seem to catch a break. Claude 35 SONET artifacts feature could be a game changer potentially altering how we use large language models. So, again, the copy is pretty good from anthropic. It's much better than chat gbt. But if I'm being honest, it kinda just took my actual words, which is not what I told it to do. I told it to reflect the tone and mimic the tone in the newsletter exactly, which it really didn't do that well. However, the overall copy is much better, but it didn't follow the instructions. So let's read the what it means for, ChatGPT here for this newsletter piece.

Jordan Wilson [00:59:19]:
It says the introduction of artifacts in quad 3 5 Sonnet, enhances user experiences by enabling tasks such as generating test documents code. Okay. So neither of them did a great job. I will say they both passed because they both correctly here's the thing y'all. I uploaded 2 separate PDFs. One of them was more than 20 pages. The other one was 8 pages. I'm going through here, and it got everything correct.

Jordan Wilson [00:59:44]:
So first and foremost, that's super important. So there's no hallucinations. They didn't get everything in the correct order, didn't get the format right, but they did they both did a serviceable job. Essentially, I mean, anthropic claw just really took the words out of my mouth, which is what I told it really not to do. I said, no. Write it like use that information from the transcript, but write it like the newsletter examples. So neither of them did a great job. I will say, you know, if I had to choose here, I will say I'll give this slight nod, to Claude.

Jordan Wilson [01:00:19]:
They both passed, but I think Claude did a little bit better. Wow. So that was a ton, y'all. That was a giant, super long, super in-depth episode. Let me give you just a quick recap. Alright. So like I said, what's new in Claude 35? Sonnet well, the model's really fantastic. It is fast.

Jordan Wilson [01:00:48]:
It is powerful. And at least according to Anthropic's benchmarks, it is much more capable than any other model. Some things I like and don't like. Love the artifacts feature. When it works, I had to do it twice, in the beginning there, but you saw it actually created a very nice website, and it generated it in real time. It rendered it. Love to see it. What I don't like, it's not connected to the Internet.

Jordan Wilson [01:01:11]:
And, apparently, you you can't upload long documents. So couple things I didn't like. You can't upload long documents, and also the chat history broke. Right? It said, hey. This is too much information for this chat. You gotta start a new chat. What happened to that long context window? Anthropic, is that gone now? Was it just because I was uploading a super long file? Regardless, don't like it. Also, the fact that it's not connected to the Internet, I cannot recommend Claude to any business person.

Jordan Wilson [01:01:38]:
I cannot recommend that you use Claude until they offer that, Period. I already gave you my my rundown on why I think that's important. Alright. So a a couple thoughts on the large language model race that's going on. I didn't really get into that too much. Let me just give you the super hot take right now. It's just gonna be a lot of back and forth if I'm being honest. Right? So, you you know, first you had quad 3 and then there's GPT-4 turbo, and now we have claud 3, or or sorry, then you had, GPT-4 o.

Jordan Wilson [01:02:04]:
So sorry. It was GPT-4 turbo first and then claud 3. Now and then, gbt4 o. Now you have, claud 35 Sonic. So it's literally been back and forth, back and forth with who has the most powerful and capable model according to the benchmarks. Here's the thing. I don't think with this, if I'm being honest, I don't think if I was OpenAI, I'm not rushing to release my next model. Here's why.

Jordan Wilson [01:02:28]:
And I think that Anthropic actually got this wrong with the timing. Because all anyone, all OpenAI has to do is release those other features, and all of a sudden people are gonna forget about this new 3 5 Sonnet. Alright? So, OpenAI announced their GPT 4 o model, but with a lot of other features that are not out yet. Kind of this, more neural and, you know, more relatable, voice. This, you know, they call it her, also kind of this live omni is what we're calling it. So the ability to for the model to see in real time and interact in real time. You know what? OpenAI could release this today or tomorrow, and I think people would stop caring as much, about this new, 3 5 Sonnet because all eyes are gonna be on that. Here's the other thing.

Jordan Wilson [01:03:18]:
They did this with their middle model, Anthropic did. So I do assume that they are they know that OpenAI is gonna clap back. And when they do, I'm guessing they are going to release this new, 35 OPUS, which presumably would, be at or above OpenAI's next model. Presumably. I don't know if they will. I think once OpenAI releases, whether it's, 45 or 5 or they might call it, GPT Next. I'm not sure what it's gonna be called. Whenever they do, if I'm being honest, I don't think I think it's gonna take everyone a couple years to catch up, like, if we're being honest.

Jordan Wilson [01:03:55]:
However, it is just gonna be this back and forth, and I just don't like the timing, here from Anthropic. Right? Apple just, you know, had their Apple intelligence announcement, like, 10 days ago, partnering up with OpenAI. You just have the Copilot Plus PCs that are shipping right now from Microsoft. Yeah. Like, come on, anthropic. Timing is everything. And the timing on this one, the timing was off. Right? You have a very great model, very capable, very fast, some very impressive things on it, and they released it on, like, a what was it? A Thursday, like, a Thursday afternoon, right right in the middle of all these other things that are going on.

Jordan Wilson [01:04:36]:
So, not crazy about the timing, and I do think that OpenAI at any point can make this kind of irrelevant. Right? The benchmarks are great. I think there's the the artifacts thing I think is gonna change how people are using large language models. And then when we look at the head to head I'm looking at my score here. Technically, Chad gbt was slightly ahead. Gbt4 o. Right? They all kind of got the same amount of things right, except Claude failed. It couldn't accept a a spreadsheet that had 600,000 cells.

Jordan Wilson [01:05:15]:
Right? So maybe it's not built for that, but yo, like, ChatGPT munched it. It crushed it. It literally gobbled up 600,000 cells in creative and and interactive spreadsheet as fast as I could read it. That is extremely impressive. So on the head to head, it it was kind of close, but there were certain things that, you know, they kind of both got right, and Chat gbt just did a little bit better. I gave you some of those examples, and then obviously, Claude just couldn't handle the size of that spreadsheet even with this long 200,000 token context window. Not sure why, but it couldn't. This was a long one, y'all.

Jordan Wilson [01:05:56]:
I hope this was helpful. If so, please shout us a give us give us a rating. Give us a rating on the podcast. If you're listening on the podcast, we'd appreciate, if you could rate or subscribe. If you're still online are you still online? Are you this is an hour. This is a marathon episode. Thank you for listening. Please, if this is helpful, repost this, tag a friend something.

Jordan Wilson [01:06:17]:
Let me know. Drop me, drop me a comment. But please join me tomorrow and every day for more everyday AI. Thanks, y'all.

Gain Extra Insights With Our Newsletter

Sign up for our newsletter to get more in-depth content on AI