Luma Dream Machine 1.5 vs 1.0 - Comparison through Luma API
Hi everyone!
On August 20th 2024, Luma - the team behind one of the best text/image-to-video generative AI model currently on the market, announced on X that their new 1.5 version is now available for the public to try!
Luma Dream Machine 1.5 Release
With the new 1.5 version update, Luma promises better overall quality, better prompt adherence, and more accurate custom text rendering in the generated videos.
Unfortunately, Luma did not provide a choice to select previous model version if one were to currently use their platform to generate videos. Thus, the difference between the former model and the new v1.5 model would be hard to evaluate.
However, given PiAPI's position as the market leading generative AI API provider, which includes Luma API (or Dream Machine API) as well, we are able to perform this comparison as we have to test our own products extensively before releasing them to the market (and continuously after the initial release).
Text-to-Video Evaluation Framework
For this comparison, we have the taken the comprehensive text-to-video evaluation framework from Labelbox, while adding "Text Adherence" into the mix since judging from our user feedback, text adherence is an important aspect of generative video model in order for it to become more prevalent as a productivity tool rather than just a fleeing entertaining toy.
Thus, below are the various aspects (and their respective explaination) we will adopt to compare the output videos of different model versions.
Prompt Adherence
We'd assessed how well the output video matched the given text prompt. For example, in the given prompt: “A peaceful Zen garden with carefully raked sand, bonsai trees, and a small koi pond.” We'd looked to see if there was prompt adherence by looking at the presence of key concepts of the prompt:
- Is there a garden?
- Does it look peaceful?
- Is the sand present, and is it raked?
- Are there bonsai trees?
- Is the a small koi pond?
Scoring
- High: If all or most of the key concepts are present.
- Medium: If half the key concepts are present.
- Low: If less than half of key concepts are present.
Text Adherence
We'd assess the level of accuracy of the text reproduced in the video as per the given prompts, if there are any text requirements in the prompts.
- High: texts are reproduced accurately in the generated video
- Medium: texts are reproduced with minor mistakes in the generated video
- Low: texts are not reproduced or reproduced with major mistakes in the generated video
Video Realism
We'd assess how closely the generated video resembles reality, although reality is not always an appropriate benchmark depending on the context of the prompt.
- High: Realistic lighting, textures, and proportions.
- Medium: Somewhat realistic but with slight issues in shadows or textures.
- Low: Animated or artificial appearance.
Artifacts
We'd scan for any visible artifacts, distortions, or errors in the video, such as:
- Unnatural distortions in objects or backgrounds
- Misplaced or floating elements
- Inconsistent lighting or shadows
- Unnatural repeating patterns
- Unnatural movements
- Blurred or pixelated areas
Scoring
- High: If all or 5 of the errors are present.
- Medium: If 2 or 3 errors are present.
- Low: If 1 or 0 errors are present.
v1.0 and v1.5 Video Comparison
And now, let's check out the same-prompt comparison between Dream Machine 1.0 versus the new Dream Machine 1.5 version model.
Prompt Adherence: Both videos display an old man walking and a park, although the 1.5 version has a prolonged shot where no old man is present, but this can be due to lack of specificity in the prompt.
Video Realism: both versions show the basic elements of the anime style, and the 1.5 version show the shadow details from the trees in the park.
Artifacts: The 1.0 version shows a significant error of the elderly man facing away from the shot but walking towards it. And the 1.5 version shows a prolonged shot with no human figure present.
Overall, we see a slight improvement from the v1.5 video.
Prompt Adherence: Both videos display a tornado with the v1.5 much more realistic than the v1.0
Video Realism: The proportion in the v1.0 is quite off as the tornado portrayed is also invisible. In the v1.5 video we can see elements flying slowing circling the eye of the tornado.
Text Adherence: The v1.0 video spelled out "BYLD Network quite accurately with the Y and L overlapping a bit together. For the v1.5 video it spelled the word with a double B.
Artifacts: The 1.0 version portrayed a much less accurate representation of the tornado compared to the v1.5 video.
Overall, we see a general improvement from the v1.5 video, with the except on Text Adherence aspect.
Prompt Adherence: Both videos display a teddy bear doning a pair of sunglasses, although if you google "teddy bear" most will bear more resemblence with the one from v1.5. Both video show a running fall amidst a jungle. However, v1.0 shows much better "headbanging" and playing motion compared to the relatively still v1.5 video.
Video Realism: The level of realism displayed in the v1.0 video is very high given the teddy's dynamic movements, the following shadows, and the realistic human-like motions. The v1.5 videos on the other hand is very underwhelming given the static look of the character.
Artifacts: The 1.5 showed a more static version of the teddy whereas the prompt specifically asked for "dancing and headbanging".
For this prompt, it is quite clear that the v1.0 video is of a higher quality version.
Prompt Adherence: Both video displayed a black Dodge Challenger on asphalt, we see a red sign, the view is from the above, but none of the cars are drifting, which could be due to the lack of movement-specific training dataset.
Video Realism: The level of realism displayed in the v1.5 is quite high - the lighting on the car, the texture of the car itself, and the shadows of the surrondings are all comparatively better than the version demonstrated in the v1.0.
Text Adherence: The texts reproduced in the v1.5 video is higher compared to the v1.0 version; the latter texts spelled "Kassar" whereas the former spelled "Kassi 34", which is just one letter short of the text specified in the prompt.
Artifacts: Neither videos showed the drifting motion as specified in the prompt. Other than that there does not seem any major artifiacts in the video.
For this prompt, the v1.0 video is of a higher quality compared to the v1.0 version.
Conclusion
Based on the various examples provided above, we can see that Dream Machine 1.5 from Luma is indeed better than the previous 1.0 version, in overall quality, text adherence, and video realism. It should be noted that the better quality does not always happen as can be observed from the teddy bear example.
However, given the probabilistic nature of generative AI model, this is to be expected.
We hope that you found our comparison useful! And if you interested, check out the generative AI APIs from PiAPI!