Skip to main content
Cogito 671B v2.1 is Deep Cogito’s massive 671B parameter Mixture-of-Experts (MoE) language model. It uses FP8 dynamic quantization for efficient inference while maintaining high-quality outputs across reasoning, coding, and general knowledge tasks.

Try in playground

Test Cogito 671B v2.1 in the Runpod Hub playground.
Endpointhttps://api.runpod.ai/v2/cogito-671b-v2-1-fp8-dynamic/runsync
Pricing$0.50 per 1M tokens
TypeText generation
This endpoint is fully compatible with the OpenAI API. See the OpenAI compatibility examples below.

Request

All parameters are passed within the input object in the request body.
input.prompt
string
required
Prompt for text generation.
input.max_tokens
integer
default:"512"
Maximum number of tokens to output.
input.temperature
float
default:"0.7"
Randomness of the output. Lower values make output more predictable and deterministic. Range: 0.0-1.0.
input.top_p
float
Nucleus sampling threshold. Samples from the smallest set of words whose cumulative probability exceeds this threshold.
input.top_k
integer
Restricts sampling to the top K most probable words.
input.stop
string
Stops generation if the given string is encountered.
curl -X POST "https://api.runpod.ai/v2/cogito-671b-v2-1-fp8-dynamic/runsync" \
  -H "Authorization: Bearer $RUNPOD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "prompt": "Write a detailed analysis of the economic impacts of renewable energy adoption:",
      "max_tokens": 1024,
      "temperature": 0.7
    }
  }'

Response

id
string
Unique identifier for the request.
status
string
Request status. Returns COMPLETED on success, FAILED on error.
delayTime
integer
Time in milliseconds the request spent in queue before processing began.
executionTime
integer
Time in milliseconds the model took to generate the response.
output
object
The generation result containing the text and usage information.
output.choices
array
Array containing the generated text.
output.cost
float
Cost of the generation in USD.
output.usage
object
Token usage information with input and output counts.
{
  "delayTime": 45,
  "executionTime": 8234,
  "id": "sync-a1b2c3d4-e5f6-7890-abcd-ef1234567890-u1",
  "output": [
    {
      "choices": [
        {
          "tokens": [
            "The economic impacts of renewable energy adoption are multifaceted and far-reaching. Here's a comprehensive analysis:\n\n1. Job Creation and Labor Markets..."
          ]
        }
      ],
      "cost": 0.0005,
      "usage": {
        "input": 20,
        "output": 980
      }
    }
  ],
  "status": "COMPLETED"
}

OpenAI API compatibility

Cogito 671B v2.1 is fully compatible with the OpenAI API format. You can use the OpenAI Python client to interact with this endpoint.
Python (OpenAI SDK)
from openai import OpenAI

client = OpenAI(
    api_key=RUNPOD_API_KEY,
    base_url="https://api.runpod.ai/v2/cogito-671b-v2-1-fp8-dynamic/openai/v1",
)

response = client.chat.completions.create(
    model="cogito-671b-v2-1-fp8-dynamic",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant with expertise in economics and analysis.",
        },
        {
            "role": "user",
            "content": "Analyze the economic impacts of renewable energy adoption.",
        },
    ],
    max_tokens=1024,
)

print(response.choices[0].message.content)
For streaming responses, add stream=True:
Python (Streaming)
response = client.chat.completions.create(
    model="cogito-671b-v2-1-fp8-dynamic",
    messages=[
        {"role": "user", "content": "Explain the principles of machine learning."}
    ],
    max_tokens=1024,
    stream=True,
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
For more details, see Send vLLM requests and the OpenAI API compatibility guide.

Cost calculation

Cogito 671B v2.1 charges $0.50 per 1M tokens. Example costs:
TokensCost
1,000 tokens$0.0005
10,000 tokens$0.005
100,000 tokens$0.05
1,000,000 tokens$0.50